With the growing presence of technology in our society, there is an rapidly increasing demand for hardware which supports our computational needs. One of the most important pieces of hardware for these computationally-intensive tasks is Graphics Processing Units (GPU) due to their ability to handle parallel processing with ease — this has made them an invaluable resource for companies pursing any sort of artificial intelligence (AI), super-computing, cryptocurrency mining, or computer graphics. For example, most individuals are likely familiar with companies such as Meta, Snapchat, and Amazon which are at the forefront of the technology industry; while these stocks are a blue chip in any investor’s portfolio, one should recognize that the performance of each of these companies largely depends on their ability to process a large amount of users’ data (i.e. supercomputing). Consequently, companies which manufacture processing chips have been a focal point for investors in recent years, as they directly control the output capabilities for a large portion of the technology sector.
The purpose of this project is to attempt to predict the price-trends of a semiconductor stock - in fact, there are actually four stocks we wish to predict: Nvidia, Advanced Micro Devices (AMD), Intel Corporation, and Taiwan Semiconductor Manufacturing. Though tackling four separate stocks slightly deviates from the project guideline of applying a variety of machine learning algorithms to a single data-set, the hope is that this will give a better prediction of which machine learning models are most useful for stock market prediction (in the case of semiconductor stocks). Put differently, if we only applied our statistical learning models to a single stock resulting in an optimal model \(M\), this would naturally raise the question of whether \(M\) fits the stock best because it is optimal for stock market predictions, or if it fits the stock best due to the stock’s characteristics. Therefore, for each of the four chip manufacturers listed above, we will apply a variety of statistical learning models , ranging from standard regression to more non-linear models like random forest learning and k-Nearest neighbors.
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(tidymodels)
library(ggplot2)
library(corrplot)
library(discrim)
library(ggthemes)
library(kableExtra)
library(yardstick)
library(visdat)
library(scales)
library(glmnet)
library(quantmod)
library(dygraphs)
library(tidyr)
library(ranger)
library(xgboost)
library(vip)
tidymodels_prefer()
conflicted::conflicts_prefer(yardstick::rsq)
conflicted::conflicts_prefer(dplyr::lag)
set.seed(3435)
One of the most useful packages for the analysis of stock market data
in R is the quantmod package. While there are several
built-in functions and tools which predict stock market trends using
statistical learning, we will simply use this package for the purpose of
pulling live stock data from Yahoo
Finance and visualizing the data in a much neater fashion than usual
ggplot2 (as using built-in statistical learning methods in
a statistical learning project is somewhat poor form).
In particular, the quantmod package allows us to
circumvent the process of downloading a CSV file of stock market data
from the internet and read it into R by instead using the
loadSymbols method, which then automatically loads all
stock data (over a potentially specified interval) into a table assigned
the same name as the stock’s symbol. Since the function
loadSymbols will automatically set the most recent day to
pull stock data as the current day, we must specify a “last day” to pull
our stock data to ensure that our models run the same every day; thus,
we will pull one fiscal year’s worth of stock data
ranging from January \(1^{\text{st}}\),
2023 to January \(1^{\text{st}}\),
2024. It is worth noting that the table loaded by
loadSymbols is not a data-frame; its
original datatype must be used for other functions in the
quantmod package such as the chartSeries plots
applied below, but we will later need to typecast this object to a usual
data-frame in order to apply most of the tidyverse
functions.
Below are formatted plots of the market data for our four stocks
(AMD, NVDA, INTC, and TSM) obtained using the chartSeries
function. We will postpone the discussion of what several terms mean
until the Exploratory Data Analysis section.
a<-loadSymbols("AMD", from="2023-01-01", to="2024-01-01")
chartSeries(AMD,TA=c(addVo(),addBBands(),addMACD()))
my_dates <- index(AMD)
AMD <- data.frame(AMD)
AMD$AMD.Date <- my_dates
a<-loadSymbols("NVDA", from="2023-01-01", to="2024-01-01")
chartSeries(NVDA,TA=c(addVo(),addBBands(),addMACD()))
NVDA <- data.frame(NVDA)
NVDA$NVDA.Date <- my_dates
a<-loadSymbols("INTC", from="2023-01-01", to="2024-01-01")
chartSeries(INTC,TA=c(addVo(),addBBands(),addMACD()))
INTC <- data.frame(INTC)
INTC$INTC.Date <- my_dates
a<-loadSymbols("TSM", from="2023-01-01", to="2024-01-01")
chartSeries(TSM,TA=c(addVo(),addBBands(),addMACD()))
TSM <- data.frame(TSM)
TSM$TSM.Date <- my_dates
With each stock’s data loaded and typecast into a data-frame, we also
want to examine the total number of predictors and observations for each
individual stock. As the beginning and ending dates specified in
loadSymbols are the same across our four stocks, it
suffices to check the dimensions of only one of our data frames:
dim(AMD)
## [1] 250 7
One may be initially misled by the fact that only 250 observations appear across a years worth of data; i.e. we only have 250 out of 365 days. However, this is simply a reflection of the fact that the New York Stock Exchange does not operate on weekends and a hanful of holidays. In addition, a quick analysis shows that there is no missing data among any of the four stocks; this is somewhat expected since stock market data is meant to be as publicly available as possible and the original features are fairly common metrics for financial institutions to collect.
vis_miss(AMD)
vis_miss(NVDA)
vis_miss(INTC)
vis_miss(TSM)
With our four stocks pulled and loaded into data frames, we first wish to examine the relevance of each predictor initially provided and then examine other possible metrics to measure the performance of a given stock by. As measuring the performance of a financial security over time relies on not only historical data, but historical data whose relevancy may depreciate depending on how recent it is (or isn’t), much of the rationale behind our added predictors in the latter half of this section revolves around study of time-series in statistics. However, since the focus of this project is on statistical-learning models and how they are applied to certain problems, several important techniques in the study of time-series (and thus stock-market analysis) will not be applied within this project.
In the first half of this section, we first examine the initial
predictors loaded into the the data frames for our stocks. For each of
the four stock symbols (AMD, NVDA, INTC, and TSM), we only consider the
term following the delimiter "." and our stock symbol:
Open: The opening price of the stock that day
(specifically, the value of the stock at 9:30AM EST).Close: The closing price of the stock that day
(specifically, the value of the stock at 4:00PM EST)Low: The minimum valuation of the stock over the given
day.High: The maximum valuation of the stock over the given
day.Volume: The total number of shares traded on a stock
over the given day (i.e. both bought and sold).Adjusted: The closing price of the stock that day
together with any adjustments due to corporate actions such as rights
offerings, dividends, and splits.The dygraph package loaded in the previous section
provides a compact and convenient way for us to measure several of the
above predictors simultaneously for each stock:
p <- dygraph(AMD[,c(1:4)], xlab = "Date", ylab = "Price", main = "AMD Price") %>%
dySeries("AMD.Open", label = "Open", color = "black") %>%
dySeries("AMD.Low", label = "Low", color = "red") %>%
dySeries("AMD.High", label = "High", color = "green") %>%
dySeries("AMD.Close", label = "Close", color = "orange") %>%
dyRangeSelector() %>%
dyCandlestick()%>%
dyCrosshair(direction = "vertical") %>%
dyHighlight(highlightCircleSize = 3, highlightSeriesBackgroundAlpha = 0.2, hideOnMouseOut = T) %>%
dyRoller(rollPeriod = 1)
p
p <- dygraph(NVDA[,c(1:4)], xlab = "Date", ylab = "Price", main = "NVDA Price") %>%
dySeries("NVDA.Open", label = "Open", color = "black") %>%
dySeries("NVDA.Low", label = "Low", color = "red") %>%
dySeries("NVDA.High", label = "High", color = "green") %>%
dySeries("NVDA.Close", label = "Close", color = "orange") %>%
dyRangeSelector() %>%
dyCandlestick()%>%
dyCrosshair(direction = "vertical") %>%
dyHighlight(highlightCircleSize = 3, highlightSeriesBackgroundAlpha = 0.2, hideOnMouseOut = T) %>%
dyRoller(rollPeriod = 1)
p
p <- dygraph(INTC[,c(1:4)], xlab = "Date", ylab = "Price", main = "INTC Price") %>%
dySeries("INTC.Open", label = "Open", color = "black") %>%
dySeries("INTC.Low", label = "Low", color = "red") %>%
dySeries("INTC.High", label = "High", color = "green") %>%
dySeries("INTC.Close", label = "Close", color = "orange") %>%
dyRangeSelector() %>%
dyCandlestick()%>%
dyCrosshair(direction = "vertical") %>%
dyHighlight(highlightCircleSize = 3, highlightSeriesBackgroundAlpha = 0.2, hideOnMouseOut = T) %>%
dyRoller(rollPeriod = 1)
p
p <- dygraph(TSM[,c(1:4)], xlab = "Date", ylab = "Price", main = "TSM Price") %>%
dySeries("TSM.Open", label = "Open", color = "black") %>%
dySeries("TSM.Low", label = "Low", color = "red") %>%
dySeries("TSM.High", label = "High", color = "green") %>%
dySeries("TSM.Close", label = "Close", color = "orange") %>%
dyRangeSelector() %>%
dyCandlestick()%>%
dyCrosshair(direction = "vertical") %>%
dyHighlight(highlightCircleSize = 3, highlightSeriesBackgroundAlpha = 0.2, hideOnMouseOut = T) %>%
dyRoller(rollPeriod = 1)
p
While the six predictors pulled from Yahoo Finance give significant insight into each stock’s historical performance over the year, there may be other, more useful metrics that we can use to assess the future growth of our securities. For example, we are attempting to predict data which relies on it’s past/recent behavior; however, all of the variables in our data frame are synchronized to measure the same 24-hour window. This issue naturally leads to the study of time series in statistics, which will ultimately motivate several of the variables we introduce. There are, however, other metrics we will introduce that are much more specific to the study of stock market data and not time series as a whole.
As mentioned above, the analysis of stock market data is a problem which falls into the branch of statistics known as time-series. Ultimately, the study of time-series is a vast and rich topic that we cannot hope to even begin covering within the scope of this project; however, we will attempt to provide a rudimentary explanation of the necessary topics for the sake of completeness.
Foremost, a time series is a collection of data points \(\{ X_t : t \in T \}\) gathered over some time domain \(T \subseteq [0, \infty)\) and ordered chronologically; while it is not necessarily intrinsic of general time series, many examples have the property that the value of a given data point \(X_{t_0}\) depends on previous historical data \(\{ X_t : t < t_0 \}\). For these types of examples, whenever we are given sample of data points \(\{ X_1, X_2, \dots, X_n \}\) ordered chronologically we refer to \(X_{i-1}, X_{i-2}, \dots, X_{i-p}\) as the lag variables associated to \(X_i\). Conveniently, this allows us to model \(X_i\) as a function of its lag variables; for example, autoregression is simply the study of regression techniques applied to the lag variables (and potentially other predictors) of a time-series.
As many of the computations in R are done via vector operations
(i.e. on the columns of a data frame), defining a lag variable is
effectively synonymous to shifting the values in a given vector “down”.
While this would likely require some sort of for-loop
operation in other languages, R provides the useful lag
method to do exactly this. However, there is one shortcoming that arises
regardless of the language used: how do we define the lag variable for
the first observation? In R, since we are shifting a given vector down
by \(p\) entries, the first \(p-1\) entries of the lag variable are
filled in with NA which can ultimately cause some issues
for numerical computations later down the road. One common fix for this
is to simply pretend the model was constant prior to the first
observation, and define the lag variables \(X_{1 - p} = X_1\) — this can be done in R
using the fill command. It should be noted that assuming
the model is constant prior to the first observation is a fairly
rudimentary solution, and there are more advanced techniques which may
give better accuracy.
# Define lag variables for AMD
AMD$AMD.Close_L1 <- lag(AMD$AMD.Close, 1)
AMD$AMD.Close_L2 <- lag(AMD$AMD.Close, 2)
AMD$AMD.Close_L3 <- lag(AMD$AMD.Close, 3)
AMD$AMD.Close_L4 <- lag(AMD$AMD.Close, 4)
AMD$AMD.Close_L5 <- lag(AMD$AMD.Close, 5)
AMD$AMD.Close_L6 <- lag(AMD$AMD.Close, 6)
AMD$AMD.Close_L7 <- lag(AMD$AMD.Close, 7)
AMD$AMD.Close_L8 <- lag(AMD$AMD.Close, 8)
AMD$AMD.Close_L9 <- lag(AMD$AMD.Close, 9)
AMD$AMD.Close_L10 <- lag(AMD$AMD.Close, 10)
# Fill in NA entries in beginning of lag variables
AMD <- AMD %>%
fill(AMD.Close_L1, .direction = "up") %>%
fill(AMD.Close_L2, .direction = "up") %>%
fill(AMD.Close_L3, .direction = "up") %>%
fill(AMD.Close_L4, .direction = "up") %>%
fill(AMD.Close_L5, .direction = "up") %>%
fill(AMD.Close_L6, .direction = "up") %>%
fill(AMD.Close_L7, .direction = "up") %>%
fill(AMD.Close_L8, .direction = "up") %>%
fill(AMD.Close_L9, .direction = "up") %>%
fill(AMD.Close_L10, .direction = "up")
# Define lag variables for NVDA
NVDA$NVDA.Close_L1 <- lag(NVDA$NVDA.Close, 1)
NVDA$NVDA.Close_L2 <- lag(NVDA$NVDA.Close, 2)
NVDA$NVDA.Close_L3 <- lag(NVDA$NVDA.Close, 3)
NVDA$NVDA.Close_L4 <- lag(NVDA$NVDA.Close, 4)
NVDA$NVDA.Close_L5 <- lag(NVDA$NVDA.Close, 5)
NVDA$NVDA.Close_L6 <- lag(NVDA$NVDA.Close, 6)
NVDA$NVDA.Close_L7 <- lag(NVDA$NVDA.Close, 7)
NVDA$NVDA.Close_L8 <- lag(NVDA$NVDA.Close, 8)
NVDA$NVDA.Close_L9 <- lag(NVDA$NVDA.Close, 9)
NVDA$NVDA.Close_L10 <- lag(NVDA$NVDA.Close, 10)
# Fill in NA entries
NVDA <- NVDA %>%
fill(NVDA.Close_L1, .direction = "up") %>%
fill(NVDA.Close_L2, .direction = "up") %>%
fill(NVDA.Close_L3, .direction = "up") %>%
fill(NVDA.Close_L4, .direction = "up") %>%
fill(NVDA.Close_L5, .direction = "up") %>%
fill(NVDA.Close_L6, .direction = "up") %>%
fill(NVDA.Close_L7, .direction = "up") %>%
fill(NVDA.Close_L8, .direction = "up") %>%
fill(NVDA.Close_L9, .direction = "up") %>%
fill(NVDA.Close_L10, .direction = "up")
# Define lag variables for INTC
INTC$INTC.Close_L1 <- lag(INTC$INTC.Close, 1)
INTC$INTC.Close_L2 <- lag(INTC$INTC.Close, 2)
INTC$INTC.Close_L3 <- lag(INTC$INTC.Close, 3)
INTC$INTC.Close_L4 <- lag(INTC$INTC.Close, 4)
INTC$INTC.Close_L5 <- lag(INTC$INTC.Close, 5)
INTC$INTC.Close_L6 <- lag(INTC$INTC.Close, 6)
INTC$INTC.Close_L7 <- lag(INTC$INTC.Close, 7)
INTC$INTC.Close_L8 <- lag(INTC$INTC.Close, 8)
INTC$INTC.Close_L9 <- lag(INTC$INTC.Close, 9)
INTC$INTC.Close_L10 <- lag(INTC$INTC.Close, 10)
# Fill in NA entries
INTC <- INTC %>%
fill(INTC.Close_L1, .direction = "up") %>%
fill(INTC.Close_L2, .direction = "up") %>%
fill(INTC.Close_L3, .direction = "up") %>%
fill(INTC.Close_L4, .direction = "up") %>%
fill(INTC.Close_L5, .direction = "up") %>%
fill(INTC.Close_L6, .direction = "up") %>%
fill(INTC.Close_L7, .direction = "up") %>%
fill(INTC.Close_L8, .direction = "up") %>%
fill(INTC.Close_L9, .direction = "up") %>%
fill(INTC.Close_L10, .direction = "up")
# Define lag variables for TSM
TSM$TSM.Close_L1 <- lag(TSM$TSM.Close, 1)
TSM$TSM.Close_L2 <- lag(TSM$TSM.Close, 2)
TSM$TSM.Close_L3 <- lag(TSM$TSM.Close, 3)
TSM$TSM.Close_L4 <- lag(TSM$TSM.Close, 4)
TSM$TSM.Close_L5 <- lag(TSM$TSM.Close, 5)
TSM$TSM.Close_L6 <- lag(TSM$TSM.Close, 6)
TSM$TSM.Close_L7 <- lag(TSM$TSM.Close, 7)
TSM$TSM.Close_L8 <- lag(TSM$TSM.Close, 8)
TSM$TSM.Close_L9 <- lag(TSM$TSM.Close, 9)
TSM$TSM.Close_L10 <- lag(TSM$TSM.Close, 10)
# Fill in NA entries
TSM <- TSM %>%
fill(TSM.Close_L1, .direction = "up") %>%
fill(TSM.Close_L2, .direction = "up") %>%
fill(TSM.Close_L3, .direction = "up") %>%
fill(TSM.Close_L4, .direction = "up") %>%
fill(TSM.Close_L5, .direction = "up") %>%
fill(TSM.Close_L6, .direction = "up") %>%
fill(TSM.Close_L7, .direction = "up") %>%
fill(TSM.Close_L8, .direction = "up") %>%
fill(TSM.Close_L9, .direction = "up") %>%
fill(TSM.Close_L10, .direction = "up")
Any time one is utilizing lag variables in a time series problem, a natural question which arises is “how far back should we look to measure current data”? In other words, what is the optimal lag length \(p\) for a given time series. It turns out this an incredibly involved problem that does not have a single universal answer for all time-series based questions; there are several advanced techniques guided by economic theory for auto-regressive models such as the F-test approach and Bayes information criterion (BIC) / Akaike information criterion (AIC) that are often used, but we will unfortunately omit any discussion of these techniques since the focus of this project is on the statistical learning models themselves. Instead, we will naively take the lag length to be two weeks (i.e. 10 business days) which is fairly short for economic time series problems measured on a daily scale.
One issue that immediately arises after defining lag variables is that they are almost always highly correlated with one another. Ultimately this makes sense, as they are in fact measuring the same continuous observable; however, adding lag variables often results in a significant amount of multicollinearity within our model. One way to eliminate this issue is to instead measure the average of our lag variables across a potentially different lag length, known as the simple moving average (SMA). Specifically, we define \[ \operatorname{SMA}(X_i, k) = \frac{1}{k}\sum_{j=0}^{k-1}X_{i-j} \] (notice this is inclusive of the current data point, but we could alternatively define the simple moving average to run over \(1 \leq j \leq k\) to specifically account for prior data).
simple_moving_average <- function(my_vec, lag_period) {
#' Takes the running average of a column vector
#'
#' Creates a new column vector whose entries are the average of the previous lag_period entries.
#' When not enough data is available to take the average over lag_period, the closest possible
#' average will be taken (for example, if lag_period = 10, then the first 2nd entry of the output
#' vector will simply be the average of the first two values, the 3rd entry of the output vector
#' will be the average of the first three values, and so forth.)
#'
#' @param my_vec the column vector to take the average values of
#' @param lag_period the number of days one wishes to average over
#'
#' @return A vector whose entries represent the average of the previous lag_period entries in my_vec
# Error handling
if(is.vector(my_vec) == FALSE){
stop("Not Vector: First argument of simple_moving_average must be a vector")
}
if(is.numeric(my_vec[1]) == FALSE){
stop("Non-numeric Entries: values of vector in first argument must be numeric.")
}
if(is.numeric(lag_period) == FALSE || lag_period != round(lag_period)){
stop("Not Integer: Second argument of simple_moving_average must be an integer larger than or equal to 2")
}
if(lag_period <= 1){
stop("Not Large Enough: Second argument of simple_moving_average must be an integer larger than or equal to 2")
}
# return variable
output_vec = c()
for (i in 1:length(my_vec)) {
# If there are less that lag_period of data previous to the current date,
# simply take the average of all the days prior to get the closest thing
# to a running average
if (i <= lag_period){
output_vec[i] = mean(my_vec[1:i])
}
else {
output_vec[i] = mean(my_vec[(i-lag_period + 1):i])
}
}
return(output_vec)
}
# Obtain four simple moving averages of AMD
AMD$AMD.SMA_cl_1W <- simple_moving_average(AMD$AMD.Close, 5)
AMD$AMD.SMA_cl_2W <- simple_moving_average(AMD$AMD.Close, 10)
AMD$AMD.SMA_cl_1M <- simple_moving_average(AMD$AMD.Close, 20)
AMD$AMD.SMA_cl_2M <- simple_moving_average(AMD$AMD.Close, 40)
# Obtain four simple moving averages of NVDA
NVDA$NVDA.SMA_cl_1W <- simple_moving_average(NVDA$NVDA.Close, 5)
NVDA$NVDA.SMA_cl_2W <- simple_moving_average(NVDA$NVDA.Close, 10)
NVDA$NVDA.SMA_cl_1M <- simple_moving_average(NVDA$NVDA.Close, 20)
NVDA$NVDA.SMA_cl_2M <- simple_moving_average(NVDA$NVDA.Close, 40)
# Obtain four simple moving averages of INTC
INTC$INTC.SMA_cl_1W <- simple_moving_average(INTC$INTC.Close, 5)
INTC$INTC.SMA_cl_2W <- simple_moving_average(INTC$INTC.Close, 10)
INTC$INTC.SMA_cl_1M <- simple_moving_average(INTC$INTC.Close, 20)
INTC$INTC.SMA_cl_2M <- simple_moving_average(INTC$INTC.Close, 40)
# Obtain four simple moving averages of TSM
TSM$TSM.SMA_cl_1W <- simple_moving_average(TSM$TSM.Close, 5)
TSM$TSM.SMA_cl_2W <- simple_moving_average(TSM$TSM.Close, 10)
TSM$TSM.SMA_cl_1M <- simple_moving_average(TSM$TSM.Close, 20)
TSM$TSM.SMA_cl_2M <- simple_moving_average(TSM$TSM.Close, 40)
# Plot the simple moving averages for each stock
ggplot(data = AMD, aes(x=AMD.Date)) +
geom_line(aes(y = AMD.SMA_cl_1W, color = '1-Week')) +
geom_line(aes(y = AMD.SMA_cl_2W, color = '2-Week')) +
geom_line(aes(y = AMD.SMA_cl_1M, color = '1-Month')) +
geom_line(aes(y = AMD.SMA_cl_2M, color = '2-Month')) +
ylab('USD') +
scale_color_manual(values = c(
'1-Week' = 'firebrick1',
'2-Week' = 'chocolate1',
'1-Month' = 'gold',
'2-Month' = 'chartreuse'
)) +
xlab('Date') +
ggtitle("(AMD) Simple Moving Average") +
scale_y_continuous( labels = label_comma()) +
theme_dark()
ggplot(data = NVDA, aes(x=NVDA.Date)) +
geom_line(aes(y = NVDA.SMA_cl_1W, color = '1-Week')) +
geom_line(aes(y = NVDA.SMA_cl_2W, color = '2-Week')) +
geom_line(aes(y = NVDA.SMA_cl_1M, color = '1-Month')) +
geom_line(aes(y = NVDA.SMA_cl_2M, color = '2-Month')) +
ylab('USD') +
scale_color_manual(values = c(
'1-Week' = 'firebrick1',
'2-Week' = 'chocolate1',
'1-Month' = 'gold',
'2-Month' = 'chartreuse'
)) +
xlab('Date') +
ggtitle("(NVDA) Simple Moving Average") +
scale_y_continuous( labels = label_comma()) +
theme_dark()
ggplot(data = INTC, aes(x=INTC.Date)) +
geom_line(aes(y = INTC.SMA_cl_1W, color = '1-Week')) +
geom_line(aes(y = INTC.SMA_cl_2W, color = '2-Week')) +
geom_line(aes(y = INTC.SMA_cl_1M, color = '1-Month')) +
geom_line(aes(y = INTC.SMA_cl_2M, color = '2-Month')) +
ylab('USD') +
scale_color_manual(values = c(
'1-Week' = 'firebrick1',
'2-Week' = 'chocolate1',
'1-Month' = 'gold',
'2-Month' = 'chartreuse'
)) +
xlab('Date') +
ggtitle("(INTC) Simple Moving Average") +
scale_y_continuous( labels = label_comma()) +
theme_dark()
ggplot(data = TSM, aes(x=TSM.Date)) +
geom_line(aes(y = TSM.SMA_cl_1W, color = '1-Week')) +
geom_line(aes(y = TSM.SMA_cl_2W, color = '2-Week')) +
geom_line(aes(y = TSM.SMA_cl_1M, color = '1-Month')) +
geom_line(aes(y = TSM.SMA_cl_2M, color = '2-Month')) +
ylab('USD') +
scale_color_manual(values = c(
'1-Week' = 'firebrick1',
'2-Week' = 'chocolate1',
'1-Month' = 'gold',
'2-Month' = 'chartreuse'
)) +
xlab('Date') +
ggtitle("(TSM) Simple Moving Average") +
scale_y_continuous(labels = label_comma()) +
theme_dark()
One characteristic that immediately becomes apparent is that evaluating the simple moving averages instead of the closing costs seems to “smooth out” the curves — in other words, the simple moving average average is much more stable and is not affected by a share’s volatility as much as our original predictors. Mathematically, this is simply a consequence of the fact that we are slowly interpolating the data with the overall average; since the overall average is a constant function (and thus infinitely differentiable), the “smoothing out” process really just us interpolating with a \(C^\infty(\mathbb{R})\) function.
Taking different values for the “lag period” of the simple moving average also has several benefits. Since the simple moving average measures the trends of a stock over that lag period, the simple moving average taken over 2 months (i.e. roughly 40 business days) measures long term behavior much more accurately than the simply moving average taken over 2 weeks (i.e. roughly 10 business days) , which would be a much better measure of short term trends. In fact, two SMA curves crossing tends to be a common trading signal for investors — when a short-period SMA curve crosses below a long-period SMA curve, this indicates the recent behavior of the stock is performing worse than its long term trends, meaning an investor would want to short the stock (i.e. invest based on the assumption the value of the stock will decrease). Similarly, when a short-period SMA curve crosses above a long-period SMA curve, this should indicate to investors that the stock is performing better than usual and thus it is a good idea to go long (i.e. purchase the stock ).
While the simple moving average (SMA) above can be incredibly useful to de-correlate the lag variables and measure the trends and behaviors of a given security, it can fall short in one particular way: it gives equal weight to events that happened at the beginning of the lag period as it does to events that happened at the end of the lag period. This is generally not the case with time series problems, as the relevancy of data points tends to decay non-linearly over time. A better model would be to weight recent events more heavily than events further in the past, and then take the average of these \(\text{values} \times \text{weights}\) — this is precisely the idea behind the exponential moving average. Specifically, we fix some smoothing factor \(\beta > 0\) define our weight to be \(w = \beta\, /\, (n+1)\) and the exponential moving average as \[ \operatorname{EMA}(X_i, n) = w \cdot X_{i}\ +\ (1-w) \cdot\operatorname{EMA}(X_{i-1}, n) \] where \(n\) is the number of days in our lag period. Its common in finance to take \(\beta = 2\), which is precisely the value we will take for the purpose of this project.
exponential_moving_average_helper <- function(my_vec, lag_period, smoothing_factor) {
#' Helper function to evaluate the exponential moving average over a fixed period
#' using an array buffer
#'
#' Creates a new column vector whose entries are the average of the previous lag_period entries.
#' When not enough data is available to take the average over lag_period, the closest possible
#' average will be taken (for example, if lag_period = 10, then the first 2nd entry of the output
#' vector will simply be the average of the first two values, the 3rd entry of the output vector
#' will be the average of the first three values, and so forth.)
#'
#' @param my_vec the column vector to take the average values of
#' @param lag_period the number of days one wishes to average over
#'
#' @return A vector whose entries represent the average of the previous lag_period entries in my_vec
# Error handling
if(is.vector(my_vec) == FALSE){
stop("Not Vector: First argument of simple_moving_average must be a vector")
}
if(is.numeric(my_vec[1]) == FALSE){
stop("Non-numeric Entries: values of vector in first argument must be numeric.")
}
if(is.numeric(lag_period) == FALSE || lag_period != round(lag_period)){
stop("Not Integer: Second argument of simple_moving_average must be an integer larger than or equal to 2")
}
if(lag_period <= 1){
stop("Not Large Enough: Second argument of simple_moving_average must be an integer larger than or equal to 2")
}
vec_len <- length(my_vec)
output_vec <- c()
output_vec[1] <- my_vec[1]
for (i in 2:vec_len) {
output_vec[i] <- smoothing_factor * my_vec[i] + (1 - smoothing_factor) * output_vec[i-1]
}
return(output_vec[vec_len])
}
exponential_moving_average <- function(my_vec, lag_period) {
#' Takes the running average of a column vector
#'
#' Creates a new column vector whose entries are the average of the previous lag_period entries.
#' When not enough data is available to take the average over lag_period, the closest possible
#' average will be taken (for example, if lag_period = 10, then the first 2nd entry of the output
#' vector will simply be the average of the first two values, the 3rd entry of the output vector
#' will be the average of the first three values, and so forth.)
#'
#' @param my_vec the column vector to take the average values of
#' @param lag_period the number of days one wishes to average over
#'
#' @return A vector whose entries represent the average of the previous lag_period entries in my_vec
# Error handling
if(is.vector(my_vec) == FALSE){
stop("Not Vector: First argument of simple_moving_average must be a vector")
}
if(is.numeric(my_vec[1]) == FALSE){
stop("Non-numeric Entries: values of vector in first argument must be numeric.")
}
if(is.numeric(lag_period) == FALSE || lag_period != round(lag_period)){
stop("Not Integer: Second argument of simple_moving_average must be an integer larger than or equal to 2")
}
if(lag_period <= 1){
stop("Not Large Enough: Second argument of simple_moving_average must be an integer larger than or equal to 2")
}
output_vec <- c()
output_vec[1] <- my_vec[1]
smoothing_factor = 2/(lag_period+1)
for (i in 2:length(my_vec)) {
# If there are less that lag_period of data previous to the current date,
# simply take the average of all the days prior to get the closest thing
# to a running average
if (i <= lag_period){
output_vec[i] = exponential_moving_average_helper(my_vec[1:i], i, 2/(i + 1))
}
else {
output_vec[i] = exponential_moving_average_helper(my_vec[(i-lag_period + 1):i], lag_period, smoothing_factor)
}
}
return(output_vec)
}
# Compute Exponential moving average for AMD
AMD$AMD.EMA_cl_1W <- exponential_moving_average(AMD$AMD.Close, 5)
AMD$AMD.EMA_cl_2W <- exponential_moving_average(AMD$AMD.Close, 10)
AMD$AMD.EMA_cl_1M <- exponential_moving_average(AMD$AMD.Close, 20)
AMD$AMD.EMA_cl_2M <- exponential_moving_average(AMD$AMD.Close, 40)
# Compute Exponential moving average for NVDA
NVDA$NVDA.EMA_cl_1W <- exponential_moving_average(NVDA$NVDA.Close, 5)
NVDA$NVDA.EMA_cl_2W <- exponential_moving_average(NVDA$NVDA.Close, 10)
NVDA$NVDA.EMA_cl_1M <- exponential_moving_average(NVDA$NVDA.Close, 20)
NVDA$NVDA.EMA_cl_2M <- exponential_moving_average(NVDA$NVDA.Close, 40)
# Compute Exponential moving average for INTC
INTC$INTC.EMA_cl_1W <- exponential_moving_average(INTC$INTC.Close, 5)
INTC$INTC.EMA_cl_2W <- exponential_moving_average(INTC$INTC.Close, 10)
INTC$INTC.EMA_cl_1M <- exponential_moving_average(INTC$INTC.Close, 20)
INTC$INTC.EMA_cl_2M <- exponential_moving_average(INTC$INTC.Close, 40)
# Compute Exponential moving average for TSM
TSM$TSM.EMA_cl_1W <- exponential_moving_average(TSM$TSM.Close, 5)
TSM$TSM.EMA_cl_2W <- exponential_moving_average(TSM$TSM.Close, 10)
TSM$TSM.EMA_cl_1M <- exponential_moving_average(TSM$TSM.Close, 20)
TSM$TSM.EMA_cl_2M <- exponential_moving_average(TSM$TSM.Close, 40)
# Plot the exponential moving average for each stock
ggplot(data = AMD, aes(x=AMD.Date)) +
geom_line(aes(y = AMD.EMA_cl_1W, color = '1-Week')) +
geom_line(aes(y = AMD.EMA_cl_2W, color = '2-Week')) +
geom_line(aes(y = AMD.EMA_cl_1M, color = '1-Month')) +
geom_line(aes(y = AMD.EMA_cl_2M, color = '2-Month')) +
ylab('USD') +
scale_color_manual(values = c(
'1-Week' = 'firebrick1',
'2-Week' = 'chocolate1',
'1-Month' = 'gold',
'2-Month' = 'chartreuse'
)) +
xlab('Date') +
ggtitle("(AMD) Exponential Moving Average") +
scale_y_continuous( labels = label_comma()) +
theme_dark()
ggplot(data = NVDA, aes(x=NVDA.Date)) +
geom_line(aes(y = NVDA.EMA_cl_1W, color = '1-Week')) +
geom_line(aes(y = NVDA.EMA_cl_2W, color = '2-Week')) +
geom_line(aes(y = NVDA.EMA_cl_1M, color = '1-Month')) +
geom_line(aes(y = NVDA.EMA_cl_2M, color = '2-Month')) +
ylab('USD') +
scale_color_manual(values = c(
'1-Week' = 'firebrick1',
'2-Week' = 'chocolate1',
'1-Month' = 'gold',
'2-Month' = 'chartreuse'
)) +
xlab('Date') +
ggtitle("(NVDA) Exponential Moving Average") +
scale_y_continuous( labels = label_comma()) +
theme_dark()
ggplot(data = INTC, aes(x=INTC.Date)) +
geom_line(aes(y = INTC.EMA_cl_1W, color = '1-Week')) +
geom_line(aes(y = INTC.EMA_cl_2W, color = '2-Week')) +
geom_line(aes(y = INTC.EMA_cl_1M, color = '1-Month')) +
geom_line(aes(y = INTC.EMA_cl_2M, color = '2-Month')) +
ylab('USD') +
scale_color_manual(values = c(
'1-Week' = 'firebrick1',
'2-Week' = 'chocolate1',
'1-Month' = 'gold',
'2-Month' = 'chartreuse'
)) +
xlab('Date') +
ggtitle("(INTC) Exponential Moving Average") +
scale_y_continuous( labels = label_comma()) +
theme_dark()
ggplot(data = TSM, aes(x=TSM.Date)) +
geom_line(aes(y = TSM.EMA_cl_1W, color = '1-Week')) +
geom_line(aes(y = TSM.EMA_cl_2W, color = '2-Week')) +
geom_line(aes(y = TSM.EMA_cl_1M, color = '1-Month')) +
geom_line(aes(y = TSM.EMA_cl_2M, color = '2-Month')) +
ylab('USD') +
scale_color_manual(values = c(
'1-Week' = 'firebrick1',
'2-Week' = 'chocolate1',
'1-Month' = 'gold',
'2-Month' = 'chartreuse'
)) +
xlab('Date') +
ggtitle("(TSM) Exponential Moving Average") +
scale_y_continuous( labels = label_comma()) +
theme_dark()
While the plots of the simple moving average and exponential average look very similar to one another, one notable difference is that the exponential average does not “smooth out” the curve as much — this is simply due to the fact that any recent volatility affects the exponential moving average much more than the simple moving average, since volatility in recent events is weighed more heavily.
With a concrete notion of the simple moving average as the mean taken
over our lag variables, a natural extension is to also measure the
standard deviation of our lag variables as well to gain an accurate
insight on the volatility of our time series. In fact, the simple moving
deviation has a very particular name in terms of stock prices:
Bollinger Bands. In the very first graphs plotted at
the beginning of this analysis (in the Loading Packages and
Data chapter) using the chartSeries command, the
gray-dotted lines running through the stock values is simply the 20-day
SMA, and the red-dotted lines above and below the stock values are the
20-day simple moving deviation (which are referred to in the chart as
the Bollinger bands). For the sake of completeness, the simple moving
deviation can be calculated as \[
\operatorname{SMD}(X_i, n) = \sqrt{ \frac{\sum_{j=0}^{n-1} (X_{i-j} -
\operatorname{SMA}(X_i, n))^2}{n-1} }
\] As with the simple moving average, we could alternatively
consider the variation where the current data point is not accounted for
in the simple moving deviation by instead ranging \(1 \leq j \leq n\).
simple_moving_deviation <- function(my_vec, lag_period) {
#' Takes the running standard deviation of a column vector
#'
#' Creates a new column vector whose entries are the standard deviation of the previous lag_period entries.
#' When not enough data is available to take the deviation over lag_period, the closest possible
#' average will be taken (for example, if lag_period = 10, then the first 2nd entry of the output
#' vector will simply be the average of the first two values, the 3rd entry of the output vector
#' will be the average of the first three values, and so forth.)
#'
#' @param my_vec the column vector to take the standard deviation of
#' @param lag_period the number of days one wishes to average over
#'
#' @return A vector whose entries represent the standard deviation of the previous lag_period entries in my_vec
# Error handling
if(is.vector(my_vec) == FALSE){
stop("Not Vector: First argument of simple_moving_average must be a vector")
}
if(is.numeric(my_vec[1]) == FALSE){
stop("Non-numeric Entries: values of vector in first argument must be numeric.")
}
if(is.numeric(lag_period) == FALSE || lag_period != round(lag_period)){
stop("Not Integer: Second argument of simple_moving_average must be an integer larger than or equal to 2")
}
if(lag_period <= 1){
stop("Not Large Enough: Second argument of simple_moving_average must be an integer larger than or equal to 2")
}
# return variable
output_vec = c()
# Setting the first standard deviation to 0 and beginning the loop
# at 2 prevents a divide by 0 error without adding an additional if-else branch
# in the loop
output_vec[1] = 0
for (i in 2:length(my_vec)) {
# If there are less that lag_period of data previous to the current date,
# simply take the average of all the days prior to get the closest thing
# to a running average
if (i <= lag_period){
output_vec[i] = sd(my_vec[1:i])
}
else {
output_vec[i] = sd(my_vec[(i-lag_period+1):i])
}
}
return(output_vec)
}
# Compute the simple moving deviation of AMD
AMD$AMD.SMD_cl_1W <- simple_moving_deviation(AMD$AMD.Close, 5)
AMD$AMD.SMD_cl_2W <- simple_moving_deviation(AMD$AMD.Close, 10)
AMD$AMD.SMD_cl_1M <- simple_moving_deviation(AMD$AMD.Close, 20)
AMD$AMD.SMD_cl_2M <- simple_moving_deviation(AMD$AMD.Close, 40)
# Compute the simple moving deviation of NVDA
NVDA$NVDA.SMD_cl_1W <- simple_moving_deviation(NVDA$NVDA.Close, 5)
NVDA$NVDA.SMD_cl_2W <- simple_moving_deviation(NVDA$NVDA.Close, 10)
NVDA$NVDA.SMD_cl_1M <- simple_moving_deviation(NVDA$NVDA.Close, 20)
NVDA$NVDA.SMD_cl_2M <- simple_moving_deviation(NVDA$NVDA.Close, 40)
# Compute the simple moving deviation of INTC
INTC$INTC.SMD_cl_1W <- simple_moving_deviation(INTC$INTC.Close, 5)
INTC$INTC.SMD_cl_2W <- simple_moving_deviation(INTC$INTC.Close, 10)
INTC$INTC.SMD_cl_1M <- simple_moving_deviation(INTC$INTC.Close, 20)
INTC$INTC.SMD_cl_2M <- simple_moving_deviation(INTC$INTC.Close, 40)
# Compute the simple moving deviation of TSM
TSM$TSM.SMD_cl_1W <- simple_moving_deviation(TSM$TSM.Close, 5)
TSM$TSM.SMD_cl_2W <- simple_moving_deviation(TSM$TSM.Close, 10)
TSM$TSM.SMD_cl_1M <- simple_moving_deviation(TSM$TSM.Close, 20)
TSM$TSM.SMD_cl_2M <- simple_moving_deviation(TSM$TSM.Close, 40)
# Plot the simple moving deviation for our four stocks
ggplot(data = AMD, aes(x=AMD.Date)) +
geom_line(aes(y = AMD.SMD_cl_1W, color = '1-Week')) +
geom_line(aes(y = AMD.SMD_cl_2W, color = '2-Week')) +
geom_line(aes(y = AMD.SMD_cl_1M, color = '1-Month')) +
geom_line(aes(y = AMD.SMD_cl_2M, color = '2-Month')) +
ylab('USD') +
scale_color_manual(values = c(
'1-Week' = 'firebrick1',
'2-Week' = 'chocolate1',
'1-Month' = 'gold',
'2-Month' = 'chartreuse'
)) +
xlab('Date') +
ggtitle("(AMD) Simple Moving Deviation") +
scale_y_continuous( labels = label_comma()) +
theme_dark()
ggplot(data = NVDA, aes(x=NVDA.Date)) +
geom_line(aes(y = NVDA.SMD_cl_1W, color = '1-Week')) +
geom_line(aes(y = NVDA.SMD_cl_2W, color = '2-Week')) +
geom_line(aes(y = NVDA.SMD_cl_1M, color = '1-Month')) +
geom_line(aes(y = NVDA.SMD_cl_2M, color = '2-Month')) +
ylab('USD') +
scale_color_manual(values = c(
'1-Week' = 'firebrick1',
'2-Week' = 'chocolate1',
'1-Month' = 'gold',
'2-Month' = 'chartreuse'
)) +
xlab('Date') +
ggtitle("(NVDA) Simple Moving Deviation") +
scale_y_continuous( labels = label_comma()) +
theme_dark()
ggplot(data = INTC, aes(x=INTC.Date)) +
geom_line(aes(y = INTC.SMD_cl_1W, color = '1-Week')) +
geom_line(aes(y = INTC.SMD_cl_2W, color = '2-Week')) +
geom_line(aes(y = INTC.SMD_cl_1M, color = '1-Month')) +
geom_line(aes(y = INTC.SMD_cl_2M, color = '2-Month')) +
ylab('USD') +
scale_color_manual(values = c(
'1-Week' = 'firebrick1',
'2-Week' = 'chocolate1',
'1-Month' = 'gold',
'2-Month' = 'chartreuse'
)) +
xlab('Date') +
ggtitle("(INTC) Simple Moving Deviation") +
scale_y_continuous( labels = label_comma()) +
theme_dark()
ggplot(data = TSM, aes(x=TSM.Date)) +
geom_line(aes(y = TSM.SMD_cl_1W, color = '1-Week')) +
geom_line(aes(y = TSM.SMD_cl_2W, color = '2-Week')) +
geom_line(aes(y = TSM.SMD_cl_1M, color = '1-Month')) +
geom_line(aes(y = TSM.SMD_cl_2M, color = '2-Month')) +
ylab('USD') +
scale_color_manual(values = c(
'1-Week' = 'firebrick1',
'2-Week' = 'chocolate1',
'1-Month' = 'gold',
'2-Month' = 'chartreuse'
)) +
xlab('Date') +
ggtitle("(TSM) Simple Moving Deviation") +
scale_y_continuous( labels = label_comma()) +
theme_dark()
Ultimately, plotting our simple moving deviation without the relevant simple moving averages is not terribly insightful; as one would expect, there is more variation across longer periods than there is across shorter periods.
The last metric we wish to consider in this analysis is closely related to the crossing of SMA curves in the subsection above; the Moving Average Convergence Divergence (MACD) is simply a value which indicates the curve crossing of the exponential moving average, as that carries greater information about the recent behavior of the data. In most stock market analysis, it is standard to consider the short-term EMA to be taken over a 2-week period and the long-term EMA to be taken over a 1-month period; thus, we define \[ \operatorname{MACD}(X_i) = \operatorname{EMA}(X_i,\ 10) - \operatorname{EMA}(X_i,\ 20) \] In the general study of time-series, the MACD is just a particular example of a curve-crossing metric between moving averages used to predict trend changes — any two long-term and short-term periods could be used to define a similar metric. In the graphs below, a curve crossing the horizontal axis acts as a trade signal to investors to either purchase or short the security depending on whether the curve is crossing into the upper or lower half-plane.
# Compute the MACD for our 4 stocks
AMD$AMD.MACD = (AMD$AMD.EMA_cl_2W - AMD$AMD.EMA_cl_1M)
NVDA$NVDA.MACD = (NVDA$NVDA.EMA_cl_2W - NVDA$NVDA.EMA_cl_1M)
INTC$INTC.MACD = (INTC$INTC.EMA_cl_2W - INTC$INTC.EMA_cl_1M)
TSM$TSM.MACD = (TSM$TSM.EMA_cl_2W - TSM$TSM.EMA_cl_1M)
# Plot the MACD line
ggplot(data = AMD, aes(x=AMD.Date)) +
geom_line(aes(y = AMD.MACD, color = 'MACD')) +
ylab('USD') +
scale_color_manual(values = c(
'MACD' = 'aquamarine'
)) +
xlab('Date') +
ggtitle("AMD MACD Line") +
scale_y_continuous( labels = label_comma()) +
theme_dark()
ggplot(data = NVDA, aes(x=NVDA.Date)) +
geom_line(aes(y = NVDA.MACD, color = 'MACD')) +
ylab('USD') +
scale_color_manual(values = c(
'MACD' = 'aquamarine'
)) +
xlab('Date') +
ggtitle("NVDA MACD Line") +
scale_y_continuous( labels = label_comma()) +
theme_dark()
ggplot(data = INTC, aes(x=INTC.Date)) +
geom_line(aes(y = INTC.MACD, color = 'MACD')) +
ylab('USD') +
scale_color_manual(values = c(
'MACD' = 'aquamarine'
)) +
xlab('Date') +
ggtitle("INTC MACD Line") +
scale_y_continuous( labels = label_comma()) +
theme_dark()
ggplot(data = TSM, aes(x=TSM.Date)) +
geom_line(aes(y = TSM.MACD, color = 'MACD')) +
ylab('USD') +
scale_color_manual(values = c(
'MACD' = 'aquamarine'
)) +
xlab('Date') +
ggtitle("TSM MACD Line") +
scale_y_continuous( labels = label_comma()) +
theme_dark()
Before setting up our statistical learning models, it is crucial to understand whether there are any underlying relationships between the initial predictors or the added predictors we have defined in the previous section. Based on our prior discussion, it is natural to expect that our lag variables are all heavily correlated with each other, only slightly decreasing the longer the lag period; in fact, the same should be true of the moving averages as the lag period increases. Since each of our four stocks have the same predictors, we only provide a complete correlation plot for one of the stocks (AMD) to confirm the above hypothesis that the added predictors (aside fom MACD) are heavily correlated with the closing price; for the other three stocks, we simply examine the remaining predictors.
select(TSM, is.numeric) %>%
cor() %>%
corrplot(method = "circle", type = "lower", diag = FALSE, tl.cex=0.6, number.cex=0.3, title="Correlation Plot")
select(NVDA, NVDA.Close, NVDA.Volume, NVDA.SMD_cl_1W, NVDA.SMD_cl_2W, NVDA.SMD_cl_1M, NVDA.SMD_cl_2M, NVDA.MACD) %>%
cor() %>%
corrplot(method = "circle", type = "lower", diag = FALSE, tl.cex=0.6, title="NVDA Correlation Plot" )
select(INTC, INTC.Close, INTC.Volume, INTC.SMD_cl_1W, INTC.SMD_cl_2W, INTC.SMD_cl_1M, INTC.SMD_cl_2M, INTC.MACD) %>%
cor() %>%
corrplot(method = "circle", type = "lower", diag = FALSE, tl.cex=0.6, title="INTC Correlation Plot")
select(TSM, TSM.Close, TSM.Volume, TSM.SMD_cl_1W, TSM.SMD_cl_2W, TSM.SMD_cl_1M, TSM.SMD_cl_2M, TSM.MACD) %>%
cor() %>%
corrplot(method = "circle", type = "lower", diag = FALSE, tl.cex=0.6, title="TSM Correlation Plot")
From the above plots we are able to confirm that there is in fact a
correlation between the closing price and each of the lag variables
which decreases with the lag period, as expected. In addition, we see
that all of our initial variables aside from volume (i.e. Low, High,
Open, Close, and Adjusted) are all highly correlated with one another;
to an extent, this tells us that Lastly, we see a somewhat unexpected
correlation between the 1-week standard moving deviation of the closing
price and the volume of securities sold on a given day for each of the
four stocks.
With a better picture in mind of how our stock prices can be measured from both the given metrics, we can now resample our data and begin training our models. This will be done in several steps, first preparing the data to ensure that our models do not become over-fitted to a particular subset. Next, we identify any possible interactions between our variables
One of the primary ways we ensure robustness of our models is by partitioning our data into training and testing data. Foremost, this ensures that our model does not become overfit to the details and noise of our underlying data-set by introducing a portion of the data which is unseen during the training phase (i.e. the testing data). Ultimately, one would want outcome variable to have similar statistics / variance across both the training and testing sets — this is accomplished by stratifying our split about the desired outcome variable.
# Create 70-30 split for AMD data
AMD_split <- initial_split(AMD, prop = 0.7, strata = AMD.Close)
AMD_train <- training(AMD_split)
AMD_test <- testing(AMD_split)
# Create 70-30 split for NVDA data
NVDA_split <- initial_split(NVDA, prop = 0.7, strata = NVDA.Close)
NVDA_train <- training(NVDA_split)
NVDA_test <- testing(NVDA_split)
# Create 70-30 split for INTC data
INTC_split <- initial_split(INTC, prop = 0.7, strata = INTC.Close)
INTC_train <- training(INTC_split)
INTC_test <- testing(INTC_split)
# Create 70-30 split for TSM data
TSM_split <- initial_split(TSM, prop = 0.7, strata = TSM.Close)
TSM_train <- training(TSM_split)
TSM_test <- testing(TSM_split)
One of the most important steps in setting up a statistical learning model is telling the model which predictors to include and how to preprocess the data in our training set. Despite the fact that we have a wide range of predictors at our disposal, adding more predictors does not always indicate better performance of our model — for example, we have defined multiple versions of the simple moving average metric for each stock, though they may not all be simultaneously useful. Therefore we wish to judiciously choose the predictors which we will base each of our models off of.
Despite the high correlation between our lag variables, auto-regression tends to provide successful rudimentary models for predicting market behavior — for example, ARIMA (Autoregressive Integrated Moving Average, also known as Box-Jenkins) is one of the most popular forecasting methods for time-series problems. Consequently, we want to ensure that all of our lag variables are present for the regression algorithms. In addition, we also want to incorporate any common trading signals among investors: for example, since the strategy of investors changes with the sign of the MACD value, this lead to propagated trends which we would like our model to pick up on. Finally, we add all short-term (1 week) metrics of data — however, it should be mentioned that the addition of these variables are the least motivated (for example, EMA is already used in MACD and SMA serves a very similar purpose).
Using the tidymodels package, we are able to specify
which variables are used to predict the outcome and process our data via
the recipe data structure — though it is effectively the
same for each individual stock, we need to repeat this step since the
variable names pulled from loadSymbols do differ.
# Create AMD recipe
AMD_recipe = recipe(AMD.Close ~ AMD.Close_L1 + AMD.Close_L2 + AMD.Close_L3 + AMD.Close_L4 + AMD.Close_L5 +
AMD.Close_L6 + AMD.Close_L7 + AMD.Close_L8 + AMD.Close_L9 + AMD.Close_L10 +
AMD.SMD_cl_1W + AMD.EMA_cl_1W + AMD.SMA_cl_1W + AMD.MACD + AMD.Volume,
data=AMD_train) %>%
step_center(all_predictors()) %>%
step_scale(all_predictors())
# Create NVDA recipe
NVDA_recipe = recipe(NVDA.Close ~ NVDA.Close_L1 + NVDA.Close_L2 + NVDA.Close_L3 + NVDA.Close_L4 + NVDA.Close_L5 +
NVDA.Close_L6 + NVDA.Close_L7 + NVDA.Close_L8 + NVDA.Close_L9 + NVDA.Close_L10 +
NVDA.SMD_cl_1W + NVDA.SMA_cl_1W + NVDA.EMA_cl_1W + NVDA.MACD + NVDA.Volume,
data=NVDA_train) %>%
step_center(all_predictors()) %>%
step_scale(all_predictors())
# Create INTC recipe
INTC_recipe = recipe(INTC.Close ~ INTC.Close_L1 + INTC.Close_L2 + INTC.Close_L3 + INTC.Close_L4 +
INTC.Close_L5 + INTC.Close_L6 + INTC.Close_L7 + INTC.Close_L8 + INTC.Close_L9 + INTC.Close_L10 +
INTC.SMD_cl_1W + INTC.SMA_cl_1W + INTC.EMA_cl_1W + INTC.MACD + INTC.Volume,
data=INTC_train) %>%
step_center(all_predictors()) %>%
step_scale(all_predictors())
# Create TSM recipe
TSM_recipe = recipe(TSM.Close ~ TSM.Close_L1 + TSM.Close_L2 + TSM.Close_L3 + TSM.Close_L4 +
TSM.Close_L5 + TSM.Close_L6 + TSM.Close_L7 + TSM.Close_L8 + TSM.Close_L9 + TSM.Close_L10 +
TSM.SMD_cl_1W + TSM.SMA_cl_1W + TSM.EMA_cl_1W + TSM.MACD + TSM.Volume,
data=TSM_train) %>%
step_center(all_predictors()) %>%
step_scale(all_predictors())
As each of our datasets contain a relatively small number of observations (N = 250) taken across 2023, it is important to maximize the utility we are getting out of our data. The purpose of k-fold cross validation is to ensure that our models do not become overfit to any particular subset of observations by rotating exactly \(k \geq 1\) subsets of our data between training data and validation. Specifically, we split our data into \(k=10\) (roughly) equal sized subsets, stratified about our outcome variable to ensure that there is (roughly) equal variance in each subset. Using these 10 subsets, we train each model 10 times, where precisely one subset is taken to be the validation data while the remaining 9 subsets serve as training data.
AMD_folds <- vfold_cv(AMD_train, v = 10, strata = AMD.Close)
NVDA_folds <- vfold_cv(NVDA_train, v = 10, strata = NVDA.Close)
INTC_folds <- vfold_cv(INTC_train, v = 10, strata = INTC.Close)
TSM_folds <- vfold_cv(TSM_train, v = 10, strata = TSM.Close)
Despite the fact that our analysis is on how statistical learning models can predict semiconductor stocks, we have not actually described any particular statistical learning algorithms up to this point — this is where the modelling step comes in. Each of our models is equivalently its own statistical learning algorithm, with added hyperparameter tuning to distinguish certain algorithms from others (for example, ridge and lasso regression). For the scope of this project, we will explore the application of seven machine learning models:
Our first and most basic model is that of standard linear regression; if we assume that our data follows a linear trend in terms of the predictors, the goal of linear regression is to find the best coefficients \(\beta_i\) such that our outcome variable \(Y\) can be described by \[ Y = \beta_0 + \sum_{i=1}^p \beta_i \vec{X_i} \] This is typically done by minimizing the residual sum of squares \[ \operatorname{RSS}=\sum_{i=1}^n (y_i - \sum_{j=1}^p \beta_j x_{ij})^2 \] on our sample set. Though somewhat rudimentary compared to some of our other models, statistical learning through linear regression lends itself quite well to mechanistic models whenever there is some sort of linear relationship involved. In fact, (linear) auto-regression is one of the most popular models to fine-tune for time series problems such as stock market predictions. However, linear regression does not always perform as well as some of our other algorithms for empirically-driven problems due to its strong linearity assumption.
############################
# Define Model for Linear Regression
############################
linear_model <- linear_reg() %>%
set_engine("lm")
A modified version of linear regression, ridge regression also attempts to fit sample data in a linear fashion subject to some additional constraints on the cost function. Instead of considering the usual residual sum of squares, a hyper-parameter \(\lambda\) which penalizes higher slopes in our formula: \[ \operatorname{RSS}_{\text{ridge}} = \sum_{i=1}^n (y_i - \sum_{j=1}^p \beta_j x_{ij})^2 + \lambda \sum_{i=1}^p \beta_i^2 \] It turns out that \(\lambda\) is precisely the Lagrange multiplier of the usual OLS minimization problem, now subject to the constraint \(\sum_{i=1}^p \beta^2 < c\) for some \(c > 0\) — geometrically this can be visualized as the intersection of our OLS level sets with the \(p\)-dimensional sphere of radius \(\sqrt{c}\):
Ultimately this makes the coefficients in our ridge model harder to interpret, since they are biased towards zero; however, this also prevents overfitting, by preventing the training data from making any of our coefficients too large.
############################
# Define Model for Ridge Regression
############################
ridge_model <- linear_reg(mixture = 0,
penalty = tune()) %>%
set_mode("regression") %>%
set_engine("glmnet")
Also known as \(L^1\) regression, Lasso regression is very similar to ridge regression in the sense that it modifies the usual cost function in linear regression with a penalty term. The difference (as the name \(L^1\) regression suggests) is that the penalty term is now taken over the Hilbert space \(L^1(\mathbb{R}, \mu)\), as opposed to \(L^2(\mathbb{R}, \mu)\) for ridge regression. \[ \operatorname{RSS}_{\text{Lasso}} = \sum_{i=1}^n (y_i - \sum_{j=1}^p \beta_j x_{ij})^2 + \lambda \sum_{i=1}^p |\beta_i| \]
For those familiar with the theory of metric spaces, optimizing \(\operatorname{RSS}_{\text{Lasso}}\) can be geometrically interpreted as finding where the usual OLS level sets intersect the unit circle with respect to the taxicab metric.
############################
# Define Model for Lasso Regression
############################
lasso_model <- linear_reg(mixture = 1,
penalty = tune()) %>%
set_mode("regression") %>%
set_engine("glmnet")
Elastic net regularization is effectively the linear interpolation of ridge regression and lasso regression. By introducing a new parameter \(\alpha \in [0, 1]\), we may combine the above two penalty terms via \[ \operatorname{RSS}_{\text{EN}} = \sum_{i=1}^n (y_i - \sum_{j=1}^p \beta_j x_{ij})^2 + \lambda \big( \alpha \sum_{i=1}^p |\beta_i| + (1-\alpha) \sum_{i=1}^p \beta_i^2 \big) \] This ultimately allows us to combine the strengths of both lasso and ridge regression over regular regression, such as handling both high dimensional data and significant collinearity. In addition, with the right grid of hype-parameter values, we can generally expect that elastic net will perform just as well as lasso and ridge (if not better).
############################
# Define Model for Elastic Net
############################
elastic_net_model <- linear_reg(mixture = tune(),
penalty = tune()) %>%
set_mode("regression") %>%
set_engine("glmnet")
Ultimately all of our regression models above assume some sort of functional (i.e. linear) form of a relationship between our predictors and outcome — that is, they are all parametric models. \(k\)-Nearest neighbors is our first example of a non-parametric model, in the sense that it does not make any prior assumptions about the relationship between our outcome and data. This ultimately makes \(k\)-nearest neighbors a much more viable option for predicting empirically-driven processes as well as classification problems; however, it should be noted that \(k\)-nearest neighbors is not commonly used with time-series problems (as there is often a very direct relationship with a predictor and its lag variables).
The idea behind \(k\)-nearest neighbors, however, is exactly what the name suggests: assuming we are able to establish some sort of metric space structure on our underlying data-set, we predict the outcome of a new data point based on the \(k\) nearest labelled values. For classification problems, this is done by assigning the class with the highest number of occurrences among the \(k\) nearest neighbors; for regression, several techniques such as weighted averages can be used.
One immediate observation that can be drawn from the
above picture is that the performance of \(k\)-nearest neighbors seems to depend on
the number \(k\) of neighbors taken
into consideration. We will see in the next sections that this is
absolutely the case, and we will ultimately need to vary \(k\) as a hyperparameter by testing it
against an array of models with varying values of \(k \geq 1\).
############################
# Define Model for k-Nearest Neighbors
############################
knn_model <- nearest_neighbor(neighbors = tune()) %>%
set_engine("kknn") %>%
set_mode("regression")
############################
# Define Model for Random Forest
############################
random_forest_model <- rand_forest(mtry = tune(),
trees = tune(),
min_n = tune()) %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("regression")
############################
# Define Model for Boosted Trees
############################
boosted_trees_model <- boost_tree(mtry = tune(),
trees = tune(),
learn_rate = tune()) %>%
set_engine("xgboost") %>%
set_mode("regression")
With all seven of our statistical learning algorithms fixed (along
with tune() applied to relevant hyperparameters), we are
now able to load the models and recipes into workflows. Workflows are
objects introduced by the tidymodels package that allows us
to keep track of all pre-processing and modelling simultaneously, and
will be particularly useful in minimizing the number of objects to keep
track of since we will be dealing with seven models for each of our four
stocks.
############################
# Set Up Workflows for Linear Regression
############################
linear_wflow_AMD <- workflow() %>%
add_model(linear_model) %>%
add_recipe(AMD_recipe)
linear_wflow_NVDA <- workflow() %>%
add_model(linear_model) %>%
add_recipe(NVDA_recipe)
linear_wflow_INTC <- workflow() %>%
add_model(linear_model) %>%
add_recipe(INTC_recipe)
linear_wflow_TSM <- workflow() %>%
add_model(linear_model) %>%
add_recipe(TSM_recipe)
############################
# Set Up Workflows for Ridge Regression
############################
ridge_wflow_AMD <- workflow() %>%
add_model(ridge_model) %>%
add_recipe(AMD_recipe)
ridge_wflow_NVDA <- workflow() %>%
add_model(ridge_model) %>%
add_recipe(NVDA_recipe)
ridge_wflow_INTC <- workflow() %>%
add_model(ridge_model) %>%
add_recipe(INTC_recipe)
ridge_wflow_TSM <- workflow() %>%
add_model(ridge_model) %>%
add_recipe(TSM_recipe)
############################
# Set Up Workflows for Lasso Regression
############################
lasso_wflow_AMD <- workflow() %>%
add_model(lasso_model) %>%
add_recipe(AMD_recipe)
lasso_wflow_NVDA <- workflow() %>%
add_model(lasso_model) %>%
add_recipe(NVDA_recipe)
lasso_wflow_INTC <- workflow() %>%
add_model(lasso_model) %>%
add_recipe(INTC_recipe)
lasso_wflow_TSM <- workflow() %>%
add_model(lasso_model) %>%
add_recipe(TSM_recipe)
############################
# Set Up Workflows for Elastic Net
############################
elastic_net_wflow_AMD <- workflow() %>%
add_model(elastic_net_model) %>%
add_recipe(AMD_recipe)
elastic_net_wflow_NVDA <- workflow() %>%
add_model(elastic_net_model) %>%
add_recipe(NVDA_recipe)
elastic_net_wflow_INTC <- workflow() %>%
add_model(elastic_net_model) %>%
add_recipe(INTC_recipe)
elastic_net_wflow_TSM <- workflow() %>%
add_model(elastic_net_model) %>%
add_recipe(TSM_recipe)
############################
# Set Up Workflows for k-Nearest Neighbors
############################
knn_wflow_AMD <- workflow() %>%
add_model(knn_model) %>%
add_recipe(AMD_recipe)
knn_wflow_NVDA <- workflow() %>%
add_model(knn_model) %>%
add_recipe(NVDA_recipe)
knn_wflow_INTC <- workflow() %>%
add_model(knn_model) %>%
add_recipe(INTC_recipe)
knn_wflow_TSM <- workflow() %>%
add_model(knn_model) %>%
add_recipe(TSM_recipe)
############################
# Set Up Workflows for Random Forest
############################
random_forest_wflow_AMD <- workflow() %>%
add_model(random_forest_model) %>%
add_recipe(AMD_recipe)
random_forest_wflow_NVDA <- workflow() %>%
add_model(random_forest_model) %>%
add_recipe(NVDA_recipe)
random_forest_wflow_INTC <- workflow() %>%
add_model(random_forest_model) %>%
add_recipe(INTC_recipe)
random_forest_wflow_TSM <- workflow() %>%
add_model(random_forest_model) %>%
add_recipe(TSM_recipe)
############################
# Set Up Workflows for Boosted Trees
############################
boosted_trees_wflow_AMD <- workflow() %>%
add_model(boosted_trees_model) %>%
add_recipe(AMD_recipe)
boosted_trees_wflow_NVDA <- workflow() %>%
add_model(boosted_trees_model) %>%
add_recipe(NVDA_recipe)
boosted_trees_wflow_INTC <- workflow() %>%
add_model(boosted_trees_model) %>%
add_recipe(INTC_recipe)
boosted_trees_wflow_TSM <- workflow() %>%
add_model(boosted_trees_model) %>%
add_recipe(TSM_recipe)
The fact that the majority of our models have additional parameters (i.e. \(k\) for \(k\)-nearest neighbors, \(\alpha\) for elastic net) is a bit of a double-edged sword: it adds a significant amount of flexibility to our models, but now begs the question of “what values of these additional parameters gives us the best results?” Unfortunately there is often no universal answer which applies to all data-sets. As a result, finding optimal hyperparameter values often feels like navigating through a dark room with our bare hands — we don’t really know where to start, so we blindly go forward until we stumble on something. Though this is a bit of an exaggeration (as there are mathematically rigorous estimates, such as searching near \(k = \sqrt{\# \text{samples}}\) for \(k\)-nearest neighbors), it can certainly feel like the case when a considerable portion of our computational runtime comes from throwing grid-search at each model.
The upshot of tuning hyperparameters is that it is quite easy to understand: we naïvely just punch in a bunch of values (within some predetermined interval) for each hyperparameter and see what happens with that model. In a certain sense, the difficult part is figuring out what interval gives us enough room for an optimized model without turning the computational complexity into a travelling-salesman problem. For both our regression and \(k\)-nearest neighbors, we simply iterate over the default values (spaced by a log base-10 scaling); for random forest and boosted trees, a bit of trial-and-error was required to see which hyperparameters led to a decrease in training root mean square error (RMSE).
############################
# Define Grid for Ridge AND Lasso Regression
############################
lambda_grid <- grid_regular(penalty( range=c(0,1500), trans = identity_trans() ),
levels = 100)
############################
# Define Grid for Elastic Net
############################
elastic_net_grid <- grid_regular(penalty(range=c(0,1500), trans = identity_trans()), mixture(),
levels = 21)
############################
# Define Model for k-Nearest Neighbors
############################
knn_grid <- grid_regular(neighbors(range=c(1,145)),
levels = 145)
############################
# Define Model for Random Forest
############################
random_forest_grid <- grid_regular(mtry(range = c(1, 14)),
trees(range = c(200,700)),
min_n(range = c(1,14)),
levels = 5)
############################
# Define Model for Boosted Trees
############################
boosted_trees_grid <- grid_regular(mtry(range = c(1, 8)),
trees(range = c(200, 700)),
learn_rate(range = c(-5, -0.5)),
levels = 5)
The most computationally-intensive part of this entire analysis is
searching for the optimal hyperparameter values within the grids we have
defined above. In order to run this computation as few times as
possible, we save our results to 24 different .rda files
using the built-in save function and instead try to
reference them via load.
############################
# Optimize parameters for Ridge Regression
############################
ridge_tune_AMD <- tune_grid(
ridge_wflow_AMD,
resamples = AMD_folds,
grid = lambda_grid
)
ridge_tune_NVDA <- tune_grid(
ridge_wflow_NVDA,
resamples = NVDA_folds,
grid = lambda_grid
)
ridge_tune_INTC <- tune_grid(
ridge_wflow_INTC,
resamples = INTC_folds,
grid = lambda_grid
)
ridge_tune_TSM <- tune_grid(
ridge_wflow_TSM,
resamples = TSM_folds,
grid = lambda_grid
)
save(ridge_tune_AMD, file = "tuned_grids/ridge_tune_AMD.rda")
save(ridge_tune_NVDA, file = "tuned_grids/ridge_tune_NVDA.rda")
save(ridge_tune_INTC, file = "tuned_grids/ridge_tune_INTC.rda")
save(ridge_tune_TSM, file = "tuned_grids/ridge_tune_TSM.rda")
############################
# Optimize parameters for Lasso Regression
############################
lasso_tune_AMD <- tune_grid(
lasso_wflow_AMD ,
resamples = AMD_folds,
grid = lambda_grid
)
lasso_tune_NVDA <- tune_grid(
lasso_wflow_NVDA ,
resamples = NVDA_folds,
grid = lambda_grid
)
lasso_tune_INTC <- tune_grid(
lasso_wflow_INTC ,
resamples = INTC_folds,
grid = lambda_grid
)
lasso_tune_TSM <- tune_grid(
lasso_wflow_TSM ,
resamples = TSM_folds,
grid = lambda_grid
)
save(lasso_tune_AMD, file = "tuned_grids/lasso_tune_AMD.rda")
save(lasso_tune_NVDA, file = "tuned_grids/lasso_tune_NVDA.rda")
save(lasso_tune_INTC, file = "tuned_grids/lasso_tune_INTC.rda")
save(lasso_tune_TSM, file = "tuned_grids/lasso_tune_TSM.rda")
############################
# Optimize parameters for Elastic Net
############################
elastic_net_tune_AMD <- tune_grid(
elastic_net_wflow_AMD ,
resamples = AMD_folds,
grid = elastic_net_grid
)
elastic_net_tune_NVDA <- tune_grid(
elastic_net_wflow_NVDA ,
resamples = NVDA_folds,
grid = elastic_net_grid
)
elastic_net_tune_INTC <- tune_grid(
elastic_net_wflow_INTC ,
resamples = INTC_folds,
grid = elastic_net_grid
)
elastic_net_tune_TSM <- tune_grid(
elastic_net_wflow_TSM ,
resamples = TSM_folds,
grid = elastic_net_grid
)
save(elastic_net_tune_AMD, file = "tuned_grids/elastic_net_tune_AMD.rda")
save(elastic_net_tune_NVDA, file = "tuned_grids/elastic_net_tune_NVDA.rda")
save(elastic_net_tune_INTC, file = "tuned_grids/elastic_net_tune_INTC.rda")
save(elastic_net_tune_TSM, file = "tuned_grids/elastic_net_tune_TSM.rda")
############################
# Optimize parameters for k-Nearest Neighbors
############################
knn_tune_AMD <- tune_grid(
knn_wflow_AMD ,
resamples = AMD_folds,
grid = knn_grid
)
knn_tune_NVDA <- tune_grid(
knn_wflow_NVDA ,
resamples = NVDA_folds,
grid = knn_grid
)
knn_tune_INTC <- tune_grid(
knn_wflow_INTC ,
resamples = INTC_folds,
grid = knn_grid
)
knn_tune_TSM <- tune_grid(
knn_wflow_TSM ,
resamples = TSM_folds,
grid = knn_grid
)
save(knn_tune_AMD, file = "tuned_grids/knn_tune_AMD.rda")
save(knn_tune_NVDA, file = "tuned_grids/knn_tune_NVDA.rda")
save(knn_tune_INTC, file = "tuned_grids/knn_tune_INTC.rda")
save(knn_tune_TSM, file = "tuned_grids/knn_tune_TSM.rda")
############################
# Optimize parameters for Random Forest
############################
random_forest_tune_AMD <- tune_grid(
random_forest_wflow_AMD,
resamples = AMD_folds,
grid = random_forest_grid
)
random_forest_tune_NVDA <- tune_grid(
random_forest_wflow_NVDA,
resamples = NVDA_folds,
grid = random_forest_grid
)
random_forest_tune_INTC <- tune_grid(
random_forest_wflow_INTC,
resamples = INTC_folds,
grid = random_forest_grid
)
random_forest_tune_TSM <- tune_grid(
random_forest_wflow_TSM,
resamples = TSM_folds,
grid = random_forest_grid
)
save(random_forest_tune_AMD, file = "tuned_grids/random_forest_tune_AMD.rda")
save(random_forest_tune_NVDA, file = "tuned_grids/random_forest_tune_NVDA.rda")
save(random_forest_tune_INTC, file = "tuned_grids/random_forest_tune_INTC.rda")
save(random_forest_tune_TSM, file = "tuned_grids/random_forest_tune_TSM.rda")
############################
# Optimize parameters for Boosted Trees
############################
boosted_trees_tune_AMD <- tune_grid(
boosted_trees_wflow_AMD,
resamples = AMD_folds,
grid = boosted_trees_grid
)
boosted_trees_tune_NVDA <- tune_grid(
boosted_trees_wflow_NVDA,
resamples = NVDA_folds,
grid = boosted_trees_grid
)
boosted_trees_tune_INTC <- tune_grid(
boosted_trees_wflow_INTC,
resamples = INTC_folds,
grid = boosted_trees_grid
)
boosted_trees_tune_TSM <- tune_grid(
boosted_trees_wflow_TSM,
resamples = TSM_folds,
grid = boosted_trees_grid
)
save(boosted_trees_tune_AMD, file = "tuned_grids/boosted_trees_tune_AMD.rda")
save(boosted_trees_tune_NVDA, file = "tuned_grids/boosted_trees_tune_NVDA.rda")
save(boosted_trees_tune_INTC, file = "tuned_grids/boosted_trees_tune_INTC.rda")
save(boosted_trees_tune_TSM, file = "tuned_grids/boosted_trees_tune_TSM.rda")
One benefit of running such a broad search is that we gather much more information on our models’ dependencies on their hyperparameters than theoretically would be needed to find an optimal model. Regardless, with this information in hand (for each model across all four stocks) we are able to visualize how the accuracy of our models changes as we vary the value of these parameters:
# Load tuned Ridge Regression grids
load("tuned_grids/ridge_tune_AMD.rda")
load("tuned_grids/ridge_tune_NVDA.rda")
load("tuned_grids/ridge_tune_INTC.rda")
load("tuned_grids/ridge_tune_TSM.rda")
autoplot(ridge_tune_AMD) + ggtitle("Ridge Regression AMD") + theme_dark()
autoplot(ridge_tune_INTC) + ggtitle("Ridge Regression NVDA") + theme_dark()
For our ridge regression models, it’s difficult to get an accurate
picture of how the root mean square error (RMSE) depends on the
regularization until the amount of regularization is taken to be
significantly large — in fact, we hand to change the range of the
penalty() parameter in our grid from \(0 \leq lambda \leq 5\) to \(0 \leq \lambda \leq 1500\) to even obtain
any noticeable affect for the NVIDIA stock. However, this is likely due
to the fact that the closing price (and thus the regression
coefficients) of NVDA are significantly larger than the other three
stocks, meaning a larger value of \(\lambda\) is needed for the \(\lambda \sum_{i=1}^p \beta_i^2\) term to
dominate the modified \(\operatorname{RSS}_{\text{Ridge}}\).
Though modifying our grid search for the purpose of visualizing the
trend between regularization and RMSE means we will not have as accurate
of a “best model” resulting from the grid search, it will ultimately not
affect our process of selecting the best two models for our testing data
(trial and error across a variety of parameters for
grid_regular show that elastic search and regular linear
regression effectively always perform better).
# Load tuned Lasso Regression grids
load("tuned_grids/lasso_tune_AMD.rda")
load("tuned_grids/lasso_tune_NVDA.rda")
load("tuned_grids/lasso_tune_INTC.rda")
load("tuned_grids/lasso_tune_TSM.rda")
autoplot(lasso_tune_AMD) + ggtitle("Lasso Regression AMD") + theme_dark()
autoplot(lasso_tune_NVDA) + ggtitle("Lasso Regression NVDA") + theme_dark()
While the difference between the penalty terms of lasso and ridge may have seemed negligible at first, we see from the above plots that the \(L^1\) norm has a significantly larger impact on the penalty term of our regression models than the \(L^2\) norm does. In particular, the \(L^1\) error terms of lasso seems to be significantly more sensitive to increases in our regularization parameter \(\lambda\), and seems to approach a maximal RMSE value significantly quicker than ridge regression does.
# Load tuned Elastic Net grids
load("tuned_grids/elastic_net_tune_AMD.rda")
load("tuned_grids/elastic_net_tune_NVDA.rda")
load("tuned_grids/elastic_net_tune_INTC.rda")
load("tuned_grids/elastic_net_tune_TSM.rda")
autoplot(elastic_net_tune_AMD) + ggtitle("Elastic Net AMD") + theme_dark()
autoplot(elastic_net_tune_NVDA) + ggtitle("Elastic Net NVDA") + theme_dark()
We are also now able to visualize quite plainly how elastic net
interpolates the penalty terms of ridge and lasso regression. For
example, the blue/purple/pink curves displayed on the NVDA plot behave
effectively the same as the lasso regression since the RMSE quickly
increases around \(\lambda = 100\)
towards the same asymptotic value.
# Load tuned k-Nearest Neighbors grids
load("tuned_grids/knn_tune_AMD.rda")
load("tuned_grids/knn_tune_NVDA.rda")
load("tuned_grids/knn_tune_INTC.rda")
load("tuned_grids/knn_tune_TSM.rda")
autoplot(knn_tune_TSM) + ggtitle("k-Nearest Neighbors AMD") + theme_dark()
autoplot(knn_tune_NVDA) + ggtitle("k-Nearest Neighbors NVDA") + theme_dark()
For \(k\)-Nearest neighbors, we
additionally see that the optimal choices of \(k\) (i.e. the local minima on the graphs
above) do occur somewhat close to the square root of the number
of sample sizes (i.e. roughly 15 in this case). However, the positive
relationship between \(k\) and our RMSE
value shows smaller values of \(k\) are
still favored — in fact, the optimal values of \(k\) vary between \(2 \leq k \leq 6\) for all four of our
stocks.
# Load tuned Random Forest grids
load("tuned_grids/random_forest_tune_AMD.rda")
load("tuned_grids/random_forest_tune_NVDA.rda")
load("tuned_grids/random_forest_tune_INTC.rda")
load("tuned_grids/random_forest_tune_TSM.rda")
autoplot(random_forest_tune_AMD) + ggtitle("Random Forest AMD") + theme_dark() + theme(text = element_text(size = 9))
autoplot(random_forest_tune_NVDA) + ggtitle("Random Forest NVDA") + theme_dark() + theme(text = element_text(size = 9))
# Load tuned Boosted Trees grids
load("tuned_grids/boosted_trees_tune_AMD.rda")
load("tuned_grids/boosted_trees_tune_NVDA.rda")
load("tuned_grids/boosted_trees_tune_INTC.rda")
load("tuned_grids/boosted_trees_tune_TSM.rda")
autoplot(boosted_trees_tune_AMD) + theme_dark() + ggtitle("Boosted Trees AMD") + theme(text = element_text(size = 9))
autoplot(boosted_trees_tune_NVDA) + theme_dark() + ggtitle("Boosted Trees NVDA") + theme(text = element_text(size = 9))
Though having a vast amount of data about the relationship between
our hyperparameters and training accuracy is useful for better
understanding each model, we ultimately only care about which exact
hyperparameter values give us the best results (according to some chosen
metrics). Using the select_best() function from our
tidymodels package, we need only specify the metric we care
most about in order to extract the optimal model for each of our seven
statistical learning algorithms, thus concluding the hyperparameter
tuning portion of this analysis.
############################
# Select best model(s) for Ridge Regression
############################
ridge_final_wflow_AMD <- select_best(ridge_tune_AMD , metric="rmse" ) %>%
finalize_workflow(x=ridge_wflow_AMD )
ridge_final_wflow_NVDA <- select_best(ridge_tune_NVDA , metric="rmse" ) %>%
finalize_workflow(x=ridge_wflow_NVDA )
ridge_final_wflow_INTC <- select_best(ridge_tune_INTC , metric="rmse" ) %>%
finalize_workflow(x=ridge_wflow_INTC )
ridge_final_wflow_TSM <- select_best(ridge_tune_TSM , metric="rmse" ) %>%
finalize_workflow(x=ridge_wflow_TSM )
############################
# Select best model(s) for Lasso Regression
############################
lasso_final_wflow_AMD <- select_best(lasso_tune_AMD , metric="rmse") %>%
finalize_workflow(x=lasso_wflow_AMD )
lasso_final_wflow_NVDA <- select_best(lasso_tune_NVDA , metric="rmse") %>%
finalize_workflow(x=lasso_wflow_NVDA )
lasso_final_wflow_INTC <- select_best(lasso_tune_INTC , metric="rmse") %>%
finalize_workflow(x=lasso_wflow_INTC )
lasso_final_wflow_TSM <- select_best(lasso_tune_TSM , metric="rmse") %>%
finalize_workflow(x=lasso_wflow_TSM )
############################
# Select best model(s) for Elastic Net
############################
elastic_net_final_wflow_AMD <- select_best(elastic_net_tune_AMD , metric = "rmse") %>%
finalize_workflow(x=elastic_net_wflow_AMD )
elastic_net_final_wflow_NVDA <- select_best(elastic_net_tune_NVDA , metric = "rmse") %>%
finalize_workflow(x=elastic_net_wflow_NVDA )
elastic_net_final_wflow_INTC <- select_best(elastic_net_tune_INTC , metric = "rmse") %>%
finalize_workflow(x=elastic_net_wflow_INTC )
elastic_net_final_wflow_TSM <- select_best(elastic_net_tune_TSM , metric = "rmse") %>%
finalize_workflow(x=elastic_net_wflow_TSM )
############################
# Select best model(s) for k-Nearest Neighbors
############################
knn_final_wflow_AMD <- select_best(knn_tune_AMD , metric = "rmse") %>%
finalize_workflow(x=knn_wflow_AMD )
knn_final_wflow_NVDA <- select_best(knn_tune_NVDA , metric = "rmse") %>%
finalize_workflow(x=knn_wflow_NVDA )
knn_final_wflow_INTC <- select_best(knn_tune_INTC , metric = "rmse") %>%
finalize_workflow(x=knn_wflow_INTC )
knn_final_wflow_TSM <- select_best(knn_tune_TSM , metric = "rmse") %>%
finalize_workflow(x=knn_wflow_TSM )
############################
# Select best model(s) for Random Forest
############################
random_forest_final_wflow_AMD <- select_best(random_forest_tune_AMD , metric = "rmse") %>%
finalize_workflow(x=random_forest_wflow_AMD )
random_forest_final_wflow_NVDA <- select_best(random_forest_tune_NVDA , metric = "rmse") %>%
finalize_workflow(x=random_forest_wflow_NVDA )
random_forest_final_wflow_INTC <- select_best(random_forest_tune_INTC , metric = "rmse") %>%
finalize_workflow(x=random_forest_wflow_INTC )
random_forest_final_wflow_TSM <- select_best(random_forest_tune_TSM , metric = "rmse") %>%
finalize_workflow(x=random_forest_wflow_TSM )
############################
# Select best model(s) for Boosted Trees
############################
boosted_trees_final_wflow_AMD <- select_best(boosted_trees_tune_AMD , metric = "rmse") %>%
finalize_workflow(x=boosted_trees_wflow_AMD )
boosted_trees_final_wflow_NVDA <- select_best(boosted_trees_tune_NVDA , metric = "rmse") %>%
finalize_workflow(x=boosted_trees_wflow_NVDA )
boosted_trees_final_wflow_INTC <- select_best(boosted_trees_tune_INTC , metric = "rmse") %>%
finalize_workflow(x=boosted_trees_wflow_INTC )
boosted_trees_final_wflow_TSM <- select_best(boosted_trees_tune_TSM , metric = "rmse") %>%
finalize_workflow(x=boosted_trees_wflow_TSM )
After an arduous process of tuning each of our seven models across all four stocks, we now have an array of workflows that should be well-primed for our entire set of training data (remember that the hyperparameter tuning was done on 10 folds of the training data, so each model has only seen \(9/10^{th}\) of it). Ultimately, the performance of our models on the the entire training set (exactly \(1/10^{th}\) of which is unseen by the models) need not give any indication about the performance of these models on unseen testing data; regardless, we will use this as a benchmark for which two models pass on to the testing data.
############################
# Fit model(s) for Linear Regression
############################
linear_fit_AMD <- fit(linear_wflow_AMD, AMD_train)
linear_fit_NVDA <- fit(linear_wflow_NVDA, NVDA_train)
linear_fit_INTC <- fit(linear_wflow_INTC, INTC_train)
linear_fit_TSM <- fit(linear_wflow_TSM, TSM_train)
############################
# Fit model(s) for Ridge Regression
############################
ridge_fit_AMD <- fit(ridge_final_wflow_AMD , AMD_train)
ridge_fit_NVDA <- fit(ridge_final_wflow_NVDA , NVDA_train)
ridge_fit_INTC <- fit(ridge_final_wflow_INTC , INTC_train)
ridge_fit_TSM <- fit(ridge_final_wflow_TSM , TSM_train)
############################
# Fit model(s) for Lasso Regression
############################
lasso_fit_AMD <- fit(lasso_final_wflow_AMD , AMD_train)
lasso_fit_NVDA <- fit(lasso_final_wflow_NVDA , NVDA_train)
lasso_fit_INTC <- fit(lasso_final_wflow_INTC , INTC_train)
lasso_fit_TSM <- fit(lasso_final_wflow_TSM , TSM_train)
############################
# Fit model(s) for Elastic Net
############################
elastic_net_fit_AMD <- fit(elastic_net_final_wflow_AMD , AMD_train)
elastic_net_fit_NVDA <- fit(elastic_net_final_wflow_NVDA , NVDA_train)
elastic_net_fit_INTC <- fit(elastic_net_final_wflow_INTC , INTC_train)
elastic_net_fit_TSM <- fit(elastic_net_final_wflow_TSM , TSM_train)
############################
# Fit model(s) for k-Nearest Neighbors
############################
knn_fit_AMD <- fit(knn_final_wflow_AMD , AMD_train)
knn_fit_NVDA <- fit(knn_final_wflow_NVDA , NVDA_train)
knn_fit_INTC <- fit(knn_final_wflow_INTC , INTC_train)
knn_fit_TSM <- fit(knn_final_wflow_TSM , TSM_train)
############################
# Fit model(s) for Random Forest
############################
random_forest_fit_AMD <- fit(random_forest_final_wflow_AMD , AMD_train)
random_forest_fit_NVDA <- fit(random_forest_final_wflow_NVDA , NVDA_train)
random_forest_fit_TSM <- fit(random_forest_final_wflow_TSM, TSM_train)
random_forest_fit_INTC <- fit(random_forest_final_wflow_INTC, INTC_train)
############################
# Fit model(s) for Boosted Trees
############################
boosted_trees_fit_AMD <- fit(boosted_trees_final_wflow_AMD , AMD_train)
boosted_trees_fit_NVDA <- fit(boosted_trees_final_wflow_NVDA , NVDA_train)
boosted_trees_fit_TSM <- fit(boosted_trees_final_wflow_TSM, TSM_train)
boosted_trees_fit_INTC <- fit(boosted_trees_final_wflow_INTC, INTC_train)
Although this step would have been much more valuable prior to the
recipe creation (though this is a bit of a Catch-22 since a recipe,
model, and fit is needed to obtain the results), the vip
library contains an incredibly useful tool to extract information from
our tree-base statistical learning algorithms regarding which variables
affect the outcome of our model most heavily. For example, we can plot
the variable importance plots for the AMD stock according to
both the random forest algorithm and the boosted trees algorithm to see
that there is actually a slight difference in terms of which varaible is
considered most relevant for predicting stock price:
random_forest_fit_AMD %>% extract_fit_parsnip() %>%
vip()
boosted_trees_fit_AMD %>% extract_fit_parsnip() %>%
vip()
It should be noted that the horizontal axis scaling should be ignored
since we are comparing the variable importance plots (VIPs) between two
distinct tree algorithms. Though there is a slight difference between
the ordering of the variables, we still see that the 1-day lag variable
(i.e. the information from the previous day) is still one of the most
important pieces of information between the two stocks. This may lead
one to believe that ‘short-term’ lag variables will always take
precedence over ‘long-term’ variables in these time-series based
problems — however, looking at the VIPs for NVDA below, we see that the
first lag variable doesn’t even appear in the top 50% of most important
variables for the random forest algorithm.
random_forest_fit_NVDA %>% extract_fit_parsnip() %>%
vip()
boosted_trees_fit_NVDA %>% extract_fit_parsnip() %>%
vip()
It is also worth considering the case that the periodicity of the NVDA
stock (i.e. dependence on 6-7 day old data) may simply be a feature of
our sample of data-set. For example, the importance of variables for our
INTC and TSM stocks seem to be much closer to what a typical investor
would choose to scrutinize in order to predict a stocks closing
price.
random_forest_fit_INTC %>% extract_fit_parsnip() %>%
vip()
boosted_trees_fit_INTC %>% extract_fit_parsnip() %>%
vip()
random_forest_fit_TSM %>% extract_fit_parsnip() %>%
vip()
boosted_trees_fit_TSM %>% extract_fit_parsnip() %>%
vip()
############################
# Use Linear Regression Fit to Predict Training Data
############################
linear_train_res_AMD <- predict(linear_fit_AMD, new_data = AMD_train %>% select(-AMD.Close)) %>%
bind_cols(AMD_train %>% select(AMD.Close))
linear_train_res_NVDA <- predict(linear_fit_NVDA, new_data = NVDA_train %>% select(-NVDA.Close)) %>%
bind_cols(NVDA_train %>% select(NVDA.Close))
linear_train_res_INTC <- predict(linear_fit_INTC, new_data = INTC_train %>% select(-INTC.Close)) %>%
bind_cols(INTC_train %>% select(INTC.Close))
linear_train_res_TSM <- predict(linear_fit_TSM, new_data = TSM_train %>% select(-TSM.Close)) %>%
bind_cols(TSM_train %>% select(TSM.Close))
############################
# Use Ridge Regression Fit to Predict Training Data
############################
ridge_train_res_AMD <- predict(ridge_fit_AMD , new_data = AMD_train %>% select(-AMD.Close)) %>%
bind_cols(AMD_train %>% select(AMD.Close))
ridge_train_res_NVDA <- predict(ridge_fit_NVDA , new_data = NVDA_train %>% select(-NVDA.Close)) %>%
bind_cols(NVDA_train %>% select(NVDA.Close))
ridge_train_res_INTC <- predict(ridge_fit_INTC , new_data = INTC_train %>% select(-INTC.Close)) %>%
bind_cols(INTC_train %>% select(INTC.Close))
ridge_train_res_TSM <- predict(ridge_fit_TSM , new_data = TSM_train %>% select(-TSM.Close)) %>%
bind_cols(TSM_train %>% select(TSM.Close))
############################
# Use Lasso Regression Fit to Predict Training Data
############################
lasso_train_res_AMD <- predict(lasso_fit_AMD , new_data = AMD_train %>% select(-AMD.Close)) %>%
bind_cols(AMD_train %>% select(AMD.Close))
lasso_train_res_NVDA <- predict(lasso_fit_NVDA , new_data = NVDA_train %>% select(-NVDA.Close)) %>%
bind_cols(NVDA_train %>% select(NVDA.Close))
lasso_train_res_INTC <- predict(lasso_fit_INTC , new_data = INTC_train %>% select(-INTC.Close)) %>%
bind_cols(INTC_train %>% select(INTC.Close))
lasso_train_res_TSM <- predict(lasso_fit_TSM , new_data = TSM_train %>% select(-TSM.Close)) %>%
bind_cols(TSM_train %>% select(TSM.Close))
############################
# Use Elastic Net Fit to Predict Training Data
############################
elastic_net_train_res_AMD <- predict(elastic_net_fit_AMD , new_data = AMD_train %>% select(-AMD.Close )) %>%
bind_cols(AMD_train %>% select(AMD.Close ))
elastic_net_train_res_NVDA <- predict(elastic_net_fit_NVDA , new_data = NVDA_train %>% select(-NVDA.Close )) %>%
bind_cols(NVDA_train %>% select(NVDA.Close ))
elastic_net_train_res_INTC <- predict(elastic_net_fit_INTC , new_data = INTC_train %>% select(-INTC.Close )) %>%
bind_cols(INTC_train %>% select(INTC.Close ))
elastic_net_train_res_TSM <- predict(elastic_net_fit_TSM , new_data = TSM_train %>% select(-TSM.Close )) %>%
bind_cols(TSM_train %>% select(TSM.Close ))
############################
# Use k-Nearest Neighbors Fit to Predict Training Data
############################
knn_train_res_AMD <- predict(knn_fit_AMD , new_data = AMD_train %>% select(-AMD.Close )) %>%
bind_cols(AMD_train %>% select(AMD.Close ))
knn_train_res_NVDA <- predict(knn_fit_NVDA , new_data = NVDA_train %>% select(-NVDA.Close )) %>%
bind_cols(NVDA_train %>% select(NVDA.Close ))
knn_train_res_INTC <- predict(knn_fit_INTC , new_data = INTC_train %>% select(-INTC.Close )) %>%
bind_cols(INTC_train %>% select(INTC.Close ))
knn_train_res_TSM <- predict(knn_fit_TSM , new_data = TSM_train %>% select(-TSM.Close )) %>%
bind_cols(TSM_train %>% select(TSM.Close ))
############################
# Use Random Forest Fit to Predict Training Data
############################
random_forest_train_res_AMD <- predict(random_forest_fit_AMD , new_data = AMD_train %>% select(-AMD.Close )) %>%
bind_cols(AMD_train %>% select(AMD.Close ))
random_forest_train_res_NVDA <- predict(random_forest_fit_NVDA , new_data = NVDA_train %>% select(-NVDA.Close )) %>%
bind_cols(NVDA_train %>% select(NVDA.Close ))
random_forest_train_res_INTC <- predict(random_forest_fit_INTC , new_data = INTC_train %>% select(-INTC.Close )) %>%
bind_cols(INTC_train %>% select(INTC.Close ))
random_forest_train_res_TSM <- predict(random_forest_fit_TSM , new_data = TSM_train %>% select(-TSM.Close )) %>%
bind_cols(TSM_train %>% select(TSM.Close ))
############################
# Use Booested Trees Fit to Predict Training Data
############################
boosted_trees_train_res_AMD <- predict(boosted_trees_fit_AMD , new_data = AMD_train %>% select(-AMD.Close )) %>%
bind_cols(AMD_train %>% select(AMD.Close ))
boosted_trees_train_res_NVDA <- predict(boosted_trees_fit_NVDA , new_data = NVDA_train %>% select(-NVDA.Close )) %>%
bind_cols(NVDA_train %>% select(NVDA.Close ))
boosted_trees_train_res_INTC <- predict(boosted_trees_fit_INTC , new_data = INTC_train %>% select(-INTC.Close )) %>%
bind_cols(INTC_train %>% select(INTC.Close ))
boosted_trees_train_res_TSM <- predict(boosted_trees_fit_TSM , new_data = TSM_train %>% select(-TSM.Close )) %>%
bind_cols(TSM_train %>% select(TSM.Close ))
Root Mean Square Error (RMSE) results:
tibble(Model = c("Linear Regression", "Ridge Regression", "Lasso Regression", "Elastic Net", "k-Nearest Neighbors", "Random Forest", "Boosted Trees"),
AMD = c((linear_train_res_AMD %>% rmse( AMD.Close, .pred))$.estimate,
(ridge_train_res_AMD %>% rmse( AMD.Close, .pred))$.estimate,
(lasso_train_res_AMD %>% rmse( AMD.Close, .pred))$.estimate,
(elastic_net_train_res_AMD %>% rmse( AMD.Close, .pred))$.estimate,
(knn_train_res_AMD %>% rmse( AMD.Close, .pred))$.estimate,
(random_forest_train_res_AMD %>% rmse( AMD.Close, .pred))$.estimate,
(boosted_trees_train_res_AMD %>% rmse( AMD.Close, .pred))$.estimate),
NVDA = c((linear_train_res_NVDA %>% rmse( NVDA.Close, .pred))$.estimate,
(ridge_train_res_NVDA %>% rmse( NVDA.Close, .pred))$.estimate,
(lasso_train_res_NVDA %>% rmse( NVDA.Close, .pred))$.estimate,
(elastic_net_train_res_NVDA %>% rmse( NVDA.Close, .pred))$.estimate,
(knn_train_res_NVDA %>% rmse( NVDA.Close, .pred))$.estimate,
(random_forest_train_res_NVDA %>% rmse( NVDA.Close, .pred))$.estimate,
(boosted_trees_train_res_NVDA %>% rmse( NVDA.Close, .pred))$.estimate),
INTC = c((linear_train_res_INTC %>% rmse( INTC.Close, .pred))$.estimate,
(ridge_train_res_INTC %>% rmse( INTC.Close, .pred))$.estimate,
(lasso_train_res_INTC %>% rmse( INTC.Close, .pred))$.estimate,
(elastic_net_train_res_INTC %>% rmse( INTC.Close, .pred))$.estimate,
(knn_train_res_INTC %>% rmse( INTC.Close, .pred))$.estimate,
(random_forest_train_res_INTC %>% rmse( INTC.Close, .pred))$.estimate,
(boosted_trees_train_res_INTC %>% rmse( INTC.Close, .pred))$.estimate),
TSM = c((linear_train_res_TSM %>% rmse( TSM.Close, .pred))$.estimate,
(ridge_train_res_TSM %>% rmse( TSM.Close, .pred))$.estimate,
(lasso_train_res_TSM %>% rmse( TSM.Close, .pred))$.estimate,
(elastic_net_train_res_TSM %>% rmse( TSM.Close, .pred))$.estimate,
(knn_train_res_TSM %>% rmse( TSM.Close, .pred))$.estimate,
(random_forest_train_res_TSM %>% rmse( TSM.Close, .pred))$.estimate,
(boosted_trees_train_res_TSM %>% rmse( TSM.Close, .pred))$.estimate)
) %>%
kable(caption = "RMSE Performance on Training Data") %>%
kable_styling(full_width = F)
| Model | AMD | NVDA | INTC | TSM |
|---|---|---|---|---|
| Linear Regression | 0.0544838 | 0.2403406 | 0.0000000 | 0.0979569 |
| Ridge Regression | 2.7796559 | 12.0599632 | 0.9053357 | 1.4343570 |
| Lasso Regression | 0.5306895 | 8.1497551 | 0.1686717 | 0.2011261 |
| Elastic Net | 0.5358922 | 5.0289858 | 0.1741111 | 0.2008612 |
| k-Nearest Neighbors | 1.9343982 | 2.0383613 | 0.3699107 | 0.4881917 |
| Random Forest | 1.0434124 | 3.6729720 | 0.3150509 | 0.5336988 |
| Boosted Trees | 0.0489457 | 0.1008268 | 0.0126148 | 0.0213412 |
\(R^2\) results:
tibble(Model = c("Linear Regression", "Ridge Regression", "Lasso Regression", "Elastic Net", "k-Nearest Neighbors", "Random Forest", "Boosted Trees"),
AMD = c((linear_train_res_AMD %>% rsq( AMD.Close, .pred))$.estimate,
(ridge_train_res_AMD %>% rsq( AMD.Close, .pred))$.estimate,
(lasso_train_res_AMD %>% rsq( AMD.Close, .pred))$.estimate,
(elastic_net_train_res_AMD %>% rsq( AMD.Close, .pred))$.estimate,
(knn_train_res_AMD %>% rsq( AMD.Close, .pred))$.estimate,
(random_forest_train_res_AMD %>% rsq( AMD.Close, .pred))$.estimate,
(boosted_trees_train_res_AMD %>% rsq( AMD.Close, .pred))$.estimate),
NVDA = c((linear_train_res_NVDA %>% rsq( NVDA.Close, .pred))$.estimate,
(ridge_train_res_NVDA %>% rsq( NVDA.Close, .pred))$.estimate,
(lasso_train_res_NVDA %>% rsq( NVDA.Close, .pred))$.estimate,
(elastic_net_train_res_NVDA %>% rsq( NVDA.Close, .pred))$.estimate,
(knn_train_res_NVDA %>% rsq( NVDA.Close, .pred))$.estimate,
(random_forest_train_res_NVDA %>% rsq( NVDA.Close, .pred))$.estimate,
(boosted_trees_train_res_NVDA %>% rsq( NVDA.Close, .pred))$.estimate),
INTC = c((linear_train_res_INTC %>% rsq( INTC.Close, .pred))$.estimate,
(ridge_train_res_INTC %>% rsq( INTC.Close, .pred))$.estimate,
(lasso_train_res_INTC %>% rsq( INTC.Close, .pred))$.estimate,
(elastic_net_train_res_INTC %>% rsq( INTC.Close, .pred))$.estimate,
(knn_train_res_INTC %>% rsq( INTC.Close, .pred))$.estimate,
(random_forest_train_res_INTC %>% rsq( INTC.Close, .pred))$.estimate,
(boosted_trees_train_res_INTC %>% rsq( INTC.Close, .pred))$.estimate),
TSM = c((linear_train_res_TSM %>% rsq( TSM.Close, .pred))$.estimate,
(ridge_train_res_TSM %>% rsq( TSM.Close, .pred))$.estimate,
(lasso_train_res_TSM %>% rsq( TSM.Close, .pred))$.estimate,
(elastic_net_train_res_TSM %>% rsq( TSM.Close, .pred))$.estimate,
(knn_train_res_TSM %>% rsq( TSM.Close, .pred))$.estimate,
(random_forest_train_res_TSM %>% rsq( TSM.Close, .pred))$.estimate,
(boosted_trees_train_res_TSM %>% rsq( TSM.Close, .pred))$.estimate)
) %>%
kable(caption="R^2 Performance on Training Data") %>%
kable_styling(full_width = F)
| Model | AMD | NVDA | INTC | TSM |
|---|---|---|---|---|
| Linear Regression | 0.9999906 | 0.9999950 | 1.0000000 | 0.9997671 |
| Ridge Regression | 0.9758765 | 0.9876447 | 0.9738857 | 0.9508820 |
| Lasso Regression | 0.9991184 | 0.9943210 | 0.9990852 | 0.9990228 |
| Elastic Net | 0.9991018 | 0.9978331 | 0.9990259 | 0.9990253 |
| k-Nearest Neighbors | 0.9889815 | 0.9996457 | 0.9956677 | 0.9943419 |
| Random Forest | 0.9966188 | 0.9988508 | 0.9969941 | 0.9933018 |
| Boosted Trees | 0.9999925 | 0.9999991 | 0.9999949 | 0.9999891 |
############################
# Use Linear Regression Fit to Predict Testing Data
############################
linear_test_res_AMD <- predict(linear_fit_AMD, new_data = AMD_test %>% select(-AMD.Close)) %>%
bind_cols(AMD_test %>% select(AMD.Close))
linear_test_res_NVDA <- predict(linear_fit_NVDA, new_data = NVDA_test %>% select(-NVDA.Close)) %>%
bind_cols(NVDA_test %>% select(NVDA.Close))
linear_test_res_INTC <- predict(linear_fit_INTC, new_data = INTC_test %>% select(-INTC.Close)) %>%
bind_cols(INTC_test %>% select(INTC.Close))
linear_test_res_TSM <- predict(linear_fit_TSM, new_data = TSM_test %>% select(-TSM.Close)) %>%
bind_cols(TSM_test %>% select(TSM.Close))
############################
# Use Booested Trees Fit to Predict Testing Data
############################
boosted_trees_test_res_AMD <- predict(boosted_trees_fit_AMD , new_data = AMD_test %>% select(-AMD.Close )) %>%
bind_cols(AMD_test %>% select(AMD.Close ))
boosted_trees_test_res_NVDA <- predict(boosted_trees_fit_NVDA , new_data = NVDA_test %>% select(-NVDA.Close )) %>%
bind_cols(NVDA_test %>% select(NVDA.Close ))
boosted_trees_test_res_INTC <- predict(boosted_trees_fit_INTC , new_data = INTC_test %>% select(-INTC.Close )) %>%
bind_cols(INTC_test %>% select(INTC.Close ))
boosted_trees_test_res_TSM <- predict(boosted_trees_fit_TSM , new_data = TSM_test %>% select(-TSM.Close )) %>%
bind_cols(TSM_test %>% select(TSM.Close ))
Root Mean Square Error (RMSE) results:
tibble(Model = c("Linear Regression", "Boosted Trees"),
AMD = c((linear_test_res_AMD %>% rmse( AMD.Close, .pred))$.estimate,
(boosted_trees_test_res_AMD %>% rmse( AMD.Close, .pred))$.estimate),
NVDA = c((linear_test_res_NVDA %>% rmse( NVDA.Close, .pred))$.estimate,
(boosted_trees_test_res_NVDA %>% rmse( NVDA.Close, .pred))$.estimate),
INTC = c((linear_test_res_INTC %>% rmse( INTC.Close, .pred))$.estimate,
(boosted_trees_test_res_INTC %>% rmse( INTC.Close, .pred))$.estimate),
TSM = c((linear_test_res_TSM %>% rmse( TSM.Close, .pred))$.estimate,
(boosted_trees_test_res_TSM %>% rmse( TSM.Close, .pred))$.estimate)
) %>%
kable(caption = "RMSE Performance on Testing Data") %>%
kable_styling(full_width = F)
| Model | AMD | NVDA | INTC | TSM |
|---|---|---|---|---|
| Linear Regression | 0.1358323 | 0.0167839 | 0.1227342 | 0.0243335 |
| Boosted Trees | 2.5445759 | 7.9287952 | 0.8221076 | 1.7683678 |
\(R^2\) results:
tibble(Model = c("Linear Regression", "Boosted Trees"),
AMD = c((linear_test_res_AMD %>% rsq( AMD.Close, .pred))$.estimate,
(boosted_trees_test_res_AMD %>% rsq( AMD.Close, .pred))$.estimate),
NVDA = c((linear_test_res_NVDA %>% rsq( NVDA.Close, .pred))$.estimate,
(boosted_trees_test_res_NVDA %>% rsq( NVDA.Close, .pred))$.estimate),
INTC = c((linear_test_res_INTC %>% rsq( INTC.Close, .pred))$.estimate,
(boosted_trees_test_res_INTC %>% rsq( INTC.Close, .pred))$.estimate),
TSM = c((linear_test_res_TSM %>% rsq( TSM.Close, .pred))$.estimate,
(boosted_trees_test_res_TSM %>% rsq( TSM.Close, .pred))$.estimate)
) %>%
kable(caption="R^2 Performance on Testing Data") %>%
kable_styling(full_width = F)
| Model | AMD | NVDA | INTC | TSM |
|---|---|---|---|---|
| Linear Regression | 0.9999486 | 1.0000000 | 0.9994401 | 0.9999875 |
| Boosted Trees | 0.9797945 | 0.9949534 | 0.9741646 | 0.9358073 |